Upload
shauna-marshall
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Special Topics inText Mining
Manuel Montes y Gómezhttp://ccc.inaoep.mx/~mmontesg/
University of Alabama at Birmingham, Spring 2011
Non-thematic text classification
Special Topics on Information Retrieval3
Agenda
• Authorship attribution– Description of the task– Applications and related tasks– Features and methods
• Sentiment classification– Description of the task– Applications– Features and methods– Cross-domain sentiment classification
Authorship attribution
Stamatatos, E. 2009. A Survey of Modern Authorship Attribution Methods, Journal of the American Society for information Science and Technology, 60(3): 538-556.
Special Topics on Information Retrieval5
AA as a classification problem
• In the typical authorship attribution problem, a text of unknown authorship is assigned to one candidate author, given a set of candidate authors for whom text samples of undisputed authorship are available.
• From a machine learning point-of-view, this can be viewed as a multi-class single-label text categorization task.
Possible applications of AA?Other related tasks?
Special Topics on Information Retrieval6
In the heart of AA
• Information retrieval methods that allow to represent and classify large volumes of text.
• Machine learning algorithms that help to handle multidimensional and sparse data allowing more expressive representations.
• NLP techniques able to analyze text efficiently and providing new forms of measures for representing the style (e.g., syntax-based features).
Special Topics on Information Retrieval7
Applications areas of AA• Intelligence
– Attribution of messages or proclamations to known terrorists– Linking different messages by authorship
• Criminal law– Identifying writers of harassing messages– Verifying the authenticity of suicide notes
• Civil law– Copyright disputes
• Computer forensics– Identifying the authors of source code of malicious software
• Literary research– Attributing anonymous or disputed literary works to known authors
Special Topics on Information Retrieval8
Related tasks
• Author verification – Deciding whether a given text was written by a certain
author or not• Plagiarism detection– Finding similarities between two texts
• Author profiling or characterization– Extracting information about the age, education, sex,
etc. of the author of a given text• Detection of stylistic inconsistencies– Analyzing collaborative writing– Detecting plagiarism (intrinsic plagiarism detection)
Special Topics on Information Retrieval9
Features and methods
• As mentioned. the main idea behind AA is that by measuring some textual features we can distinguish between texts written by different authors.
• It is important to have features that quantify the writing style of authors, and apply methods able to learn from that kind of features.
How to address the AA problem?What features could be used?
Special Topics on Information Retrieval10
Lexical features (1)
• Several different lexical features have been used in the task of AA:– Simple measures such as sentence length counts
and word length counts• Can be applied to any language and any corpus• For certain languages is not trivial to do word
segmentation Chinese, German, etc.
– Vocabulary richness and the number of hapax legomena (i.e., words occurring once).• Vocabulary size heavily depends on text-length
Special Topics on Information Retrieval11
Lexical features (2)– Traditional bag-of-words text representation• Good for topic classification, but not necessarily
capture the writing style of authors.
– Function words• Are used in a largely unconscious manner by the
authors and they are topic-independent
– Subset of more frequent words• Similar problems than bag-of-words
– Word n-grams• Not always better than individual word features• Dimensionality increases considerably
Special Topics on Information Retrieval12
Character features
• According to this family of measures, a text is viewed as a mere sequence of characters.
• Various character-level measures:– alphabetic characters count, digit characters count,
uppercase and lowercase characters count, letter frequencies, punctuation marks count, etc.
• Frequencies of character n-grams– Lexical information (e.g., |_in_|, |text|)– Contextual information (e.g., |in_t|)– Use of punctuation and capitalization– Common used suffix (e.g., |ful_|, |ing_| )
Special Topics on Information Retrieval13
Character features – some issues
• Extracting n-grams is language-independent and requires no special tools.
• Dimensionality is considerably increased in comparison to the word-based representation.
• How long should the n-grams be?– A large n better capture lexical and contextual
information but it would also capture thematic information; increase dimensionality
– A small n would be able to represent sub-word (syllable-like) information but it would not be adequate for representing the contextual information.
Special Topics on Information Retrieval14
Syntactic features (1)
• The idea is that authors tend to use similar syntactic patterns unconsciously.
• Syntactic information is considered more reliable authorial fingerprint in comparison to lexical Information
• Disadvantages:– Robust and accurate NLP tools are require to
perform syntactic analysis of texts– Language-dependent procedure
Special Topics on Information Retrieval15
Syntactic features (2)
• POS tag frequencies or POS tag n-gram frequencies– A_DD few_JJ examples_NNS of_PREP heterologous_JJ
expression_NN
• Noun phrase counts, verb phrase counts, length of noun phrases, length of verb phrases, etc.– NP[Another attempt] VP[to exploit] NP[syntactic
information] VP[was proposed] PP[by Stamatatos, et al. (2000)].
• Rewrite rule frequencies from the output of a syntactic parser– PP PREP + NP
Special Topics on Information Retrieval16
Semantic features
• The more detailed the text analysis required for extracting features, the less accurate the produced measures.– Few attempts to exploit high-level features
• Examples of the usage of semantic information:– Use semantic relations (from dependency trees)– Use synonyms and hypernyms of words (Wordnet)– Detect semantic similarity between words by
means of LSI
Special Topics on Information Retrieval17
Domain-specific features
• In some applications it is possible to use some structural measures to quantify the authorial style.
• Some examples are:– Use of greetings and farewells in the messages– Types of signatures– Use of indentation– Paragraph length– Font color counts and font size counts
Special Topics on Information Retrieval18
Authorship attribution methods• Instance-based approaches– Each training text is individually represented as a separate
instance of authorial style.– Uses vector space representations and apply supervised
learning algorithms such as traditional text classification.
• Profile-based approaches– Concatenate all the available training texts per author in
one big file and extract a cumulative representation of that author’s style (profile) from this concatenated text.
What is better?Advantages and disadvantages?
Special Topics on Information Retrieval19
Profile-based approaches (1)
• Training just comprises the extraction of profiles for the candidate authors.
• Attribution is based on the distance of the profile of an unseen text and the profile of each author.
• It can be realized by using probabilistic and compression models
Special Topics on Information Retrieval20
Profile-based approaches (2)
• Probabilistic models: attempt to maximize the probability P(x|a) for a text x to belong to an author a.– Can be applied to both character and word sequences
• Compression models: the difference in bit-wise size of the compressed files d(x, xa)=C(xa +x)–C(xa) indicates the similarity of text x with author a.– Several compression algorithms have been tested
including RAR, LZW, GZIP, BZIP2, 7ZIP.
Special Topics on Information Retrieval21
Comparison table
Special Topics on Information Retrieval22
Additional issues
• The number of candidate authors– Increasing the number of authors leads to a significant
decrease in performance– Character n-grams outperform other feature types;
providing a more heterogeneous set of features improve the results significantly
• The size of the training set– AA can lead to reasonable results even when only
limited data is available– Character n-grams show more robustness to the effect
of data size than syntactic or word-based features
Sentiment analysis
Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, Vol. 2, No 1-2 (2008) 1–135.
Special Topics on Information Retrieval24
New classification alternatives• Growing interest in non-topical text analysis– Analysis of the opinions, feelings, and attitudes
expressed in a text, rather than just the facts.• Web resources such as discussion forums, review
sites and blogs are a great source of information:– Many people guide their own decisions by the opinions
that other consumers have publicity expressed.– Analysts from government, commercial, and political
domains require tools to automatically track attitudes and feelings in the news and on-line forums
Applications?
Special Topics on Information Retrieval25
Applications (1)
• As a sub-component technology– Recommendation systems: penalize items that
receive a lot of negative feedback– Information extraction: discard information found in
subjective sentences– Question answering: handle opinion-oriented
questions– Summarization: consider multiple viewpoints– Citation analysis: determine whether an author is
citing a piece of work as supporting evidence or as research that he or she dismisses.
Special Topics on Information Retrieval26
Applications (2)
• In business and government intelligence– Product quality: Classify products based on their
reviews maybe for future recommendation, to stop production, etc.
– Product analysis: Identifying product features that customers have expressed their opinions on to change design, to use in publicity, etc.
– Analysis of political debates: find speeches that represent support of or opposition to a given proposal
– Reputation analysis: identify bad and good opinions over public personalities (e.g., politicians).
Special Topics on Information Retrieval27
Two main tasks
• Subjectivity classification/detection– Distinguish sentences used to present opinions
and other forms of subjectivity from sentences used to objectively present factual information.
• Sentiment classification– Classify the opinion as falling under one of two
opposing sentiment polarities {positive and negative}, or locate its position on the continuum between these two polarities.
How to carry out these tasks?Which features could be useful?
Special Topics on Information Retrieval28
Main features for sentiment analysis (1)
• Bag of words– Better results using Boolean weight than tf-idf• Word presence is enough since sentiment is not usually
highlighted through repeated use of the same terms.
• Lexical features beyond single words– The position of a token within a textual unit can
potentially have important effects on how much that token affects the overall sentiment or subjectivity status of the enclosing textual unit.
– Word n-grams; their usefulness appears to be a matter of some debate.
Special Topics on Information Retrieval29
Main features for sentiment analysis (2)
• Part of speech (POS) tags– Idea is to capture the presence (or polarity) of
(certain) adjectives and adverbs. – 0ther parts of speech also contribute to express
sentiments (nouns: gem; verbs: love).• Syntactic features– Collocations and syntactic patterns have been
found to be useful for subjectivity detection. Patterns such as: • <subj> was satisfied; to condemn <dobj>
Special Topics on Information Retrieval30
Supervised sentiment classification
• Uses labeled document sets• Consider all described features– Best results using lexical features– Robust results with binary weights
• Applies standard text-categorization algorithms– Best reported results using SVM and Naïve Bayes
How to do the classification without a training set?
Special Topics on Information Retrieval31
Unsupervised sentiment classification• Idea: it is not hard to identify the sentiment words
and their orientation.• The algorithm in one paper (Turney, 2002) is:
1. Select phrases containing adjectives or adverbs 2. Extract pairs of words ADJ NOUN or NOUN NOUN3. Estimate the semantic orientation of the extracted
phrases, using the PMI-IR algorithm (against some seed words; e.g., awful and excellent)
4. Calculate the average semantic orientation of the phrases in the given review and classify the review as recommended if the average is positive and otherwise not recommended.