Special Topics in Text Mining Manuel Montes y Gómez mmontesg/ [email protected] University of Alabama at Birmingham, Spring 2011

Special Topics inText Mining

Manuel Montes y Gómezhttp://ccc.inaoep.mx/~mmontesg/

[email protected]

University of Alabama at Birmingham, Spring 2011

Non-thematic text classification

Special Topics on Information Retrieval3

Agenda

• Authorship attribution– Description of the task– Applications and related tasks– Features and methods

• Sentiment classification– Description of the task– Applications– Features and methods– Cross-domain sentiment classification

Authorship attribution

Stamatatos, E. 2009. A Survey of Modern Authorship Attribution Methods, Journal of the American Society for information Science and Technology, 60(3): 538-556.


AA as a classification problem

• In the typical authorship attribution problem, a text of unknown authorship is assigned to one candidate author, given a set of candidate authors for whom text samples of undisputed authorship are available.

• From a machine learning point-of-view, this can be viewed as a multi-class single-label text categorization task.

Possible applications of AA?Other related tasks?


In the heart of AA

• Information retrieval methods that allow to represent and classify large volumes of text.

• Machine learning algorithms that help to handle multidimensional and sparse data allowing more expressive representations.

• NLP techniques able to analyze text efficiently and providing new forms of measures for representing the style (e.g., syntax-based features).


Applications areas of AA• Intelligence

– Attribution of messages or proclamations to known terrorists– Linking different messages by authorship

• Criminal law– Identifying writers of harassing messages– Verifying the authenticity of suicide notes

• Civil law– Copyright disputes

• Computer forensics– Identifying the authors of source code of malicious software

• Literary research– Attributing anonymous or disputed literary works to known authors


Related tasks

• Author verification – Deciding whether a given text was written by a certain

author or not• Plagiarism detection– Finding similarities between two texts

• Author profiling or characterization– Extracting information about the age, education, sex,

etc. of the author of a given text• Detection of stylistic inconsistencies– Analyzing collaborative writing– Detecting plagiarism (intrinsic plagiarism detection)


Features and methods

• As mentioned. the main idea behind AA is that by measuring some textual features we can distinguish between texts written by different authors.

• It is important to have features that quantify the writing style of authors, and apply methods able to learn from that kind of features.

How to address the AA problem?What features could be used?


Lexical features (1)

• Several different lexical features have been used in the task of AA:– Simple measures such as sentence length counts

and word length counts• Can be applied to any language and any corpus• For certain languages is not trivial to do word

segmentation Chinese, German, etc.

– Vocabulary richness and the number of hapax legomena (i.e., words occurring once).• Vocabulary size heavily depends on text-length


Lexical features (2)– Traditional bag-of-words text representation• Good for topic classification, but not necessarily

capture the writing style of authors.

– Function words• Are used in a largely unconscious manner by the

authors and they are topic-independent

– Subset of more frequent words• Similar problems than bag-of-words

– Word n-grams• Not always better than individual word features• Dimensionality increases considerably


Character features

• According to this family of measures, a text is viewed as a mere sequence of characters.

• Various character-level measures:– alphabetic characters count, digit characters count,

uppercase and lowercase characters count, letter frequencies, punctuation marks count, etc.

• Frequencies of character n-grams– Lexical information (e.g., |_in_|, |text|)– Contextual information (e.g., |in_t|)– Use of punctuation and capitalization– Common used suffix (e.g., |ful_|, |ing_| )


Character features – some issues

• Extracting n-grams is language-independent and requires no special tools.

• Dimensionality is considerably increased in comparison to the word-based representation.

• How long should the n-grams be?– A large n better capture lexical and contextual

information but it would also capture thematic information; increase dimensionality

– A small n would be able to represent sub-word (syllable-like) information but it would not be adequate for representing the contextual information.


Syntactic features (1)

• The idea is that authors tend to use similar syntactic patterns unconsciously.

• Syntactic information is considered more reliable authorial fingerprint in comparison to lexical Information

• Disadvantages:– Robust and accurate NLP tools are require to

perform syntactic analysis of texts– Language-dependent procedure


Syntactic features (2)

• POS tag frequencies or POS tag n-gram frequencies– A_DD few_JJ examples_NNS of_PREP heterologous_JJ

expression_NN

• Noun phrase counts, verb phrase counts, length of noun phrases, length of verb phrases, etc.– NP[Another attempt] VP[to exploit] NP[syntactic

information] VP[was proposed] PP[by Stamatatos, et al. (2000)].

• Rewrite rule frequencies from the output of a syntactic parser– PP PREP + NP


Semantic features

• The more detailed the text analysis required for extracting features, the less accurate the produced measures.– Few attempts to exploit high-level features

• Examples of the usage of semantic information:– Use semantic relations (from dependency trees)– Use synonyms and hypernyms of words (Wordnet)– Detect semantic similarity between words by

means of LSI


Domain-specific features

• In some applications it is possible to use some structural measures to quantify the authorial style.

• Some examples are:– Use of greetings and farewells in the messages– Types of signatures– Use of indentation– Paragraph length– Font color counts and font size counts


Authorship attribution methods• Instance-based approaches– Each training text is individually represented as a separate

instance of authorial style.– Uses vector space representations and apply supervised

learning algorithms such as traditional text classification.

• Profile-based approaches– Concatenate all the available training texts per author in

one big file and extract a cumulative representation of that author’s style (profile) from this concatenated text.

What is better?Advantages and disadvantages?


Profile-based approaches (1)

• Training just comprises the extraction of profiles for the candidate authors.

• Attribution is based on the distance of the profile of an unseen text and the profile of each author.

• It can be realized by using probabilistic and compression models


Profile-based approaches (2)

• Probabilistic models: attempt to maximize the probability P(x|a) for a text x to belong to an author a.– Can be applied to both character and word sequences

• Compression models: the difference in bit-wise size of the compressed files d(x, xa)=C(xa +x)–C(xa) indicates the similarity of text x with author a.– Several compression algorithms have been tested

including RAR, LZW, GZIP, BZIP2, 7ZIP.


Comparison table


Additional issues

• The number of candidate authors– Increasing the number of authors leads to a significant

decrease in performance– Character n-grams outperform other feature types;

providing a more heterogeneous set of features improve the results significantly

• The size of the training set– AA can lead to reasonable results even when only

limited data is available– Character n-grams show more robustness to the effect

of data size than syntactic or word-based features

Sentiment analysis

Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, Vol. 2, No 1-2 (2008) 1–135.


New classification alternatives• Growing interest in non-topical text analysis– Analysis of the opinions, feelings, and attitudes

expressed in a text, rather than just the facts.• Web resources such as discussion forums, review

sites and blogs are a great source of information:– Many people guide their own decisions by the opinions

that other consumers have publicity expressed.– Analysts from government, commercial, and political

domains require tools to automatically track attitudes and feelings in the news and on-line forums

Applications?


Applications (1)

• As a sub-component technology– Recommendation systems: penalize items that

receive a lot of negative feedback– Information extraction: discard information found in

subjective sentences– Question answering: handle opinion-oriented

questions– Summarization: consider multiple viewpoints– Citation analysis: determine whether an author is

citing a piece of work as supporting evidence or as research that he or she dismisses.


Applications (2)

• In business and government intelligence– Product quality: Classify products based on their

reviews maybe for future recommendation, to stop production, etc.

– Product analysis: Identifying product features that customers have expressed their opinions on to change design, to use in publicity, etc.

– Analysis of political debates: find speeches that represent support of or opposition to a given proposal

– Reputation analysis: identify bad and good opinions over public personalities (e.g., politicians).


Two main tasks

• Subjectivity classification/detection– Distinguish sentences used to present opinions

and other forms of subjectivity from sentences used to objectively present factual information.

• Sentiment classification– Classify the opinion as falling under one of two

opposing sentiment polarities {positive and negative}, or locate its position on the continuum between these two polarities.

How to carry out these tasks?Which features could be useful?


Main features for sentiment analysis (1)

• Bag of words– Better results using Boolean weight than tf-idf• Word presence is enough since sentiment is not usually

highlighted through repeated use of the same terms.

• Lexical features beyond single words– The position of a token within a textual unit can

potentially have important effects on how much that token affects the overall sentiment or subjectivity status of the enclosing textual unit.

– Word n-grams; their usefulness appears to be a matter of some debate.


Main features for sentiment analysis (2)

• Part of speech (POS) tags– Idea is to capture the presence (or polarity) of

(certain) adjectives and adverbs. – 0ther parts of speech also contribute to express

sentiments (nouns: gem; verbs: love).• Syntactic features– Collocations and syntactic patterns have been

found to be useful for subjectivity detection. Patterns such as: • <subj> was satisfied; to condemn <dobj>


Supervised sentiment classification

• Uses labeled document sets• Consider all described features– Best results using lexical features– Robust results with binary weights

• Applies standard text-categorization algorithms– Best reported results using SVM and Naïve Bayes

How to do the classification without a training set?


Unsupervised sentiment classification• Idea: it is not hard to identify the sentiment words

and their orientation.• The algorithm in one paper (Turney, 2002) is:

1. Select phrases containing adjectives or adverbs 2. Extract pairs of words ADJ NOUN or NOUN NOUN3. Estimate the semantic orientation of the extracted

phrases, using the PMI-IR algorithm (against some seed words; e.g., awful and excellent)

4. Calculate the average semantic orientation of the phrases in the given review and classify the review as recommended if the average is positive and otherwise not recommended.

Documents

Special Topics in Text Mining Manuel Montes y Gómez mmontesg/ [email protected] University of Alabama at Birmingham, Spring 2011