NLP Literature Survey with focus on Computerized Deception Detection

Natural Language Processing

Literature Survey

Overview of Computerized Deception Detection

Yoav Francis, IDC Herzelia

11/06/2013

1

Part 1 - Topics of Interest

1. Sentiment Analysis

2. Sign Language (capture and recognition)

3. Computational Creative Naming

4. Computerized Deception Detection

5. NLP Approaches for Multiword Expressions

6. Answer Extraction

7. Natural Language Generation

8. Automatic Text Summarization

9. NLP-based Bibliometrics

10. Natural Language User Interfaces for Relational Databases

11. Truecasing - Restoring Case Information for badly/non-cased text

12. The Web as a Corpus

2

Part 2 - Extension on 4 Selected Topics

1 - Sign Language (capture and recognition)

The American Sign language is the primary means of communication for around 1.5 milliondeaf people in the United States [2]. It is a visual-gestural language using upper bodygestures. There is no written form of sign language - currently corpora take the form ofvideos [13]- and the NLP �eld may need to adapt for this research �eld. There were a fewattempts to create a sign language corpora ([13]) , but they have yet to be learned from alinguistic / NLP perspective. Tools and adaptation of existing tools need to be developedin order to face this challenge - in regards to timing, spatial reference, in�ection and newmethods of uni�ed motion capture for use in sign language analysis.

2 - Truecasing

Truecasing is the problem of determining proper capitalization for a sentence/documentwhen it is uncapitlized / wrongly capitalized. This is mainly for use in English and anylanguage whose script includes a distinction between lower and upper case letters. Theproblem is irrelevant for languages that are not written in Latin, Cyrillic, Greek or Arme-nian alphabet. Truecasing is an aid for many tasks (besides readability, of course) such asentity recognition, translation and content extraction. The process main aim is to restorecase information to raw text. ([11, 14])

3 -Natural Language User Interfaces for Relational Databases

A Natural language interface for a database allows the user to type in natural languagequeries (such as : �what buses leave on 16:00 from Tel-Aviv?�) - and are the transformedto an SQL query. This translation phase poses as an NLP challenge in some regards - itrequires a morphological and syntactic analysis, followed by a semantic analysis in order totransform the user's question input to a few intermediate-language representations - thatcorrelate to the possible options for the user's question, before choosing the one that will betransformed to an SQL query. This architecture is formally known as �Natural LanguageInterface for Databases� (NLIDB). A popular implementation of such an NLIDB is calledEdite and is widely available. ([10, 15])

4 - Computerized Deception Detection

Computerized detection of deception is the process of detecting �authenticity� and �truth-fulness� in a given text (for example, someone writing false reviews). Methods for doingso can be simply lexical (in a sense that they simply use dictionary word count), or usingPOS tagging and n-grams for higher rate of success. Some previous insights include, forexample, that deceivers use verbs and pronouns more often. More complex approachesto yield better detection rate include referring to the syntactic stylometry of the text, byusing CFG trees. Uses for this detection can be implemented for detecting fake reviews(�Opinion Spam�) ([4, 16, 17])

3

Part 3 - 2-Page Survey - Computerized Deception Detection

Deception detection, or Deceptive opinion detection, is the task of inferring and decidingwhether a given text, that carries some opinion is deceptive (or �false�). To further clearwhat this means, take, for instance, an hotel review site - an �adversary� may post a reviewthat was deliberately written to sound authentic and to deceive the reader that this reviewis indeed truthful. The `deception` we will be referring to in this summary will be of userreviews / opinions..

The task at hand therefore is, given some text (or review), to decide whether the reviewis truthful. The need for this is rather clear - preventing deceptive opinion spam [17] inmediums where reviews or opinions are written or posted. Nowdays, where crowdsourcingplatforms such Amazon Mechanical Turk exists, deceiving opinions can easily be generatedand can bias a user for the better (or for worst).

The task poses as quite a challenge - since we do want to reach as few false-positivesas possible, and the task itself involves many aspects from the �eld of natural languageprocessing.

As with many other natural-language-based tasks, this tasks also requires some data - forexample, from some review websites (in [4], for example, data from tripadvisor was taken).We need some `reviews` that are guaranteed to be truthful and some that are guaranteedto be deceptive - that is , that can be used as a gold data-set that we can compare ourevaluation against. It is worth noting that even without an applicable gold dataset, thereexists an heuristic approach for evaluation ([19]).

In turning to evaluate deceiving reviews - we shall regard the case that such a gold-set exists(in [17], the gold-sets for deceiving reviews were generated by using Amazon MechanicalTurk). As for the 'truthful' part of the gold set, that is, truthful reviews - that can becollected from authenticated and well-reputated users (that was also done in [17]). Suchdatasets, that can be domain-speci�c, are publicly available ([20])

Before attempting to do a machine based evaluation, it is interesting to inspect the perfor-mance of human evaluation. In [17] it is summarized that humans judgement/detection ofdeceit is poor, and according to their test a maximum average accuracy of 61% of correctlytelling truth from deceit - concluding that the correlation between same/di�erent decisionsby di�erent people regarding a given review is almost at-chance.

As for an automated, NLP-approch for the issue - There exist several approaches :

One approach is based on analyzing the frequency of POS tagging as a comparison basisfor deciding whether a given text is deceitful or truthful. In the analysis of this method in[17] it was shown to have the lowest accuracy from all machine-based methods.

A second approach is based on psycholinguistics in order to be able to detect personalitytraits. such tool widely exists (�LIWC� , [21]) . It is basically a bit more socially-orientedapproach to the previous POS tagging mechanism. Analysis of this method yielded a bitbetter results than the POS-based one.

A third approach introduces n-grams to the model, and categorization of the text. Usingthis type of classi�cation dramatically increased the success of detection and yielded anaccuracy of ~88%. This signi�es the fact that the context of words in the sentence (thatis, n-grams based detection) is a major contributor when detecting deceiving opinions.

4

Finally, a lately published article [4] suggested an even more novel approach - taking intoaccount the `syntactic stylometry` (that is, evaluating the similarity of di�erent opinionsbased on the 'style of writing'. According to [22], Similar work in regards to syntacticstylometry has been made in regards to authorship attribution and even age attributionfor blogs [23].

This more novel method can be achieved with techniques based on Probabilistic ContextFree Grammar (PCFG) parse trees - as this is the most prominent technique for analysisof syntactic stylometry[17, 22, 23]. Previously mentioned methods are based only onshallow lexico-syntactic features. In [4], analysis of this method yielded very high statisticalevidence of deep syntactic patterns that allow us to detect deceitful texts with very highaccuracy (91.2%)

It is also worth noting that in all machine-models suggested above, the precision and recallparameters were very close to each other, as can be seen in the comparison table in [17].

Further research has also been made in regards to duplicate opinion detection (in a sensethat the same writer wrote duplicate reviews, but wrote each in a `di�erent way`), andspeci�c deception detection techniques that can be model-speci�c ([18])

As a quick test to the reader and to signify the (lack-of) human evaluation skills of deceit -have a look at �gure 1 and see if you can tell which review is truthful and which is deceitful(this was taken from[17]).

Figure 1: Truthful and Deceitful Reviews/Opinions1. I have stayed at many hotels traveling for both business and pleasure and I can honestly stay that The

James is tops. The service at the hotel is �rst class. The rooms are modern and very comfortable. The

location is perfect within walking distance to all of the great sights and restaurants. Highly recommend

to both business travelers and couples.

2. My husband and I stayed at the James Chicago Hotel for our anniversary. This place is fan-

tastic! We knew as soon as we arrived we made the right choice! The rooms are BEAUTIFUL and the

sta� very attentive and wonderful!! The area of the hotel is great, since I love to shop I couldn't ask for

more!! We will de�natly be back to Chicago and we will for sure be back to the James Chicago.

Future work obviously includes adapting the above methods to other problem domains,for example, reviews of other kinds, or any platform where user feedback and opinion ispossible. Deception is a rather prevalent phenomenon ([24]) - in many mediums where userscan express their opinions. Another interesting direction would be to analyze deception andtruthfulness on combined data from many di�erent data sets (for example, hotel reviews,movie reviews, products, etc.) and seeing whether we can come up with a valid deceptioncriteria for some text from the aforementioned domains, and not from a speci�c domainbased on that domain training.

Personally and to conclude - I found the deception detection topic and its regards to NLPquite fascinating, and very much enjoyed reading the relevant papers on the subject. Itseems like we are `almost-there` on creating and streamlining a product that will be ableto detect deceiving opinions on the web (or anywhere else)

5

Part 4 - References

[1] Becky Sue Parton, Sign Language Recognition and Translation: A MultidisciplinedApproach From the Field of Arti�cial Intelligence, Journal of Deaf Studies, 2011

[2] Lu and Huenerfauth, Collecting a Motion-Capture Corpus of American Sign Languagefor Data-Driven Generation Research, NAACL HLT, 2010

[3] Ozbal and Strapparava, A Computational Approach to the Automation of CreativeNaming, ACL 2012

[4] Feng, Banerjee and Choi, Syntactic Stylometry for Deception Detection, ACL 2012

[5] Sag, Baldwin et al. , Multiword Expressions: A Pain in the Neck for NLP, StanfordUniversity LinGO Project, 2001

[6] Abney, Collins and Singhal, Answer Extraction, AT&T Shannon Labs, ANLC 2000

[7] Reiter and Dale, Building Natural Language Generation Systems, Cambridge Press,2000

[8] Hahn and Reimer, Advances in automatic text summarization, MIT Press, 1999

[9] Abu-Jbara, Ezra and Radev, Purpose and Polarity of Citation: Towards NLP-basedBibliometrics, NAACL-HLT 2013

[10] Filipe and Mamede, Databases and Natural Language Interfaces, CSTC Portugal,2007

[11] Lita, Roukos et al., tRuEcasIng, ACL 2003

[12] Kilgarri� and Grefenstette, Introduction to the Special Issue on the Web as Corpus,ACL 2003

[13] Segouat and Bra�ort, Toward Categorization of Sign Language Corpora, AFNLP 2009

[14] English Wikipedia, �Truecasing�

[15] Stratica, Kosseim and Desai, NLIDB Templates for Semantic Parsing, Concordia Uni-versity, Canada

[16] Argamon, Koppel and Avneri, Style-based Text Categorization: What Newspaper AmI Reading?, AAAI 1998

[17] Ott et al., Finding Deceptive Opinion Spam by Any Stretch of the Imagination, ACL2011

[18] Jindal and Liu, Opinion Spam and Analysis, WSDM 2008

[19] Wu et. al, Distortion as a Validation Criterion in the Identi�cation of SuspiciousReviews, SOMA 2010

[20] TripAdvisor Ireland Dataset, http://mlg.ucd.ie/datasets/trip

[21] Linguistic Inquiry and Word Count (LIWC) - http://www.liwc.net/

[22] Hollingsworth, Syntactic Stylometry: Using Sentence Structure for Authorship Attri-bution, University of Georgia, 2012

[23] Jaget Sastry, Blogger Age Attribution Using Syntactic Stylometry, https://bitbucket.org/jagatsastry/

[24] Ott, Cardie and Hancock, Estimating the prevalence of deception in online reviewcommunities, WWW 2012

6

Education

NLP Literature Survey with focus on Computerized Deception Detection