Upload
livi
View
39
Download
1
Tags:
Embed Size (px)
DESCRIPTION
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources. Carmen Banea, Rada Mihalcea University of North Texas [email protected], [email protected]. Janyce Wiebe University of Pittsburg [email protected]. Subjectivity analysis. - PowerPoint PPT Presentation
Citation preview
Carmen Banea, Rada Mihalcea
University of North [email protected], [email protected]
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources
Janyce WiebeUniversity of Pittsburg
Subjectivity analysisSubjectivity analysis (opinions and sentiments)Used in a wide variety of applications
Tracking sentiment timelines in news (Lloyd et. al, 2005)Review classification (Turney, 2002; Pang et. al, 2002)Mining opinions from product reviews (Hu and Liu, 2004)Expressive text-to-speech synthesis (Alm et. al, 2005)Text semantic analysis (Wiebe and Mihalcea, 2006; Esuli
and Sebastiani, 2006)Question answering (Yu and Hatzivassiloglou, 2003)
Much work on subjectivity analysis has focused on EnglishJapanese (Takumura et. al, 2006), Chinese (Hu et. al,
2005), German (Kim and Hovy, 2006)
Proportion of Languages on the Web
internetworldstats.com ~ updated November 30, 2007
ObjectiveDevelop a method for subjectivity analysis
thatRequires few electronic resources Can be easily ported to a new language
Applicable to the large number of languages that have scarce electronic resources
Related WorkTools that rely on manually or semi-automatically
constructed lexiconsYu and Hatzivassiloglou, 2003; Riloff and Wiebe, 2003; Kim
and Hovy, 2006Enable the efficient rule-based subjectivity and sentiment
classifiers that rely on the presence of lexicon entries in text
These tools assume the availability of advanced language processing tools:
Syntactic parsers (Wiebe, 2000), Information extraction (Riloff and Wiebe, 2003)
broad-coverage rich lexical resources WordNet (Essuli and Sebastiani, 2006)
Our approach relates most closely to the method of (Turney, 2002) for the construction of lexicons annotated for polarityWe address the task of acquiring a subjectivity lexicon We rely on fewer, smaller-scale resources
Our MethodBased on bootstrappingRequires:
A small seed set of subjective entriesOne/multiple electronic dictionariesA small training corpus (approx.
500,000 words)Experiments focused on Romanian
Applicable to other languages as well
Bootstrapping Process
seedsseeds query Candidate synonymsCandidate synonyms
Max. no. of iterations?
no
yes
Candidate synonymsCandidate synonyms
Selected synonymsSelected synonyms
Variable filtering
Online dictionary
Fixed filtering
Seed SetCategory
Sample Entries (with their English translation)
Noun blestem (curse), despot (tyrant), furie (fury), idiot (idiot), fericire (happiness)
Verb iubi (love), aprecia (appreciate), spera (hope), dori (wish), uri (hate)
Adjective
frumos (beautiful), dulce (sweet), urat (ugly), fericit (happy), fascinant (fascinating)
Adverb posibil (possibly), probabil (probably),desigur (of course), enervant (unnerving)
60 seeds, evenhandedly sampled from verbs, nouns, adjectives and adverbs.
Manually selectedSeed sources:
XI-th grade curriculum for Romanian Language and Literature
Translations of instances appearing in the OpinionFinder strong subjective lexicon (Wiebe and Riloff, 2005)
Expansion
Romanian dictionary: http://www.dexonline.roDictionaries for other languages are also available, or
can be obtained from paper dictionaries through OCR
Definition
All open-class words, that have a definition in the dictionary
longer than 3 lettersDiacritics are removed
Candidate synonymsCandidate synonyms
SeedSeed
FilteringCandidates are filtered based on a measure
of similarity with the original seedsWe use Latent Semantic Analysis (LSA)
(Dumais et al., 1988) trained on the SemCor corpus (Miller et al., 1993)
After each iteration, only candidates with an LSA score higher than a given threshold are selected for further expansion
Example:Seed: dulce (sweet)Candidate synonyms: cu gust dulce (sweet-
tasting). placut (pleasant), dulceag (quasi-sweet)
FilteringSeveral iterations of the bootstrapping
process will result in a subjectivity lexicon consisting of a ranked list of candidates in decreasing order of similarity to the original seeds
A variable filtering threshold can be used to further restrict the similarity for a more pure lexicon
Filtering parameters:Similarity thresholdNumber of iterations
Lexicon Acquisition
EvaluationRule-based classifier of subjectivity
(Riloff and Wiebe, 2003)Subjective sentence: three or more subjective
entries.Objective sentence: two subjective entries or less.
Gold standard data set (Mihalcea, Banea and Wiebe, 2007)504 sentences from five SemCor documents
(manually translated in Romanian)Labeled by two annotatorsAgreement (all): 83% (=0.67)Agreement (uncertain removed): 89% (=0.77)Baseline: 54% (all subjective)
Number of Iterations
F-measure for the bootstrapping subjectivity lexicon over 5 iterations and an LSA threshold of 0.5
Similarity Threshold
F-measure for the fifth bootstrapping iteration for varying LSA scores
Comparison
Bootstrapping rule-based classifier: uses a 3913 entries subjectivity lexicon obtained through 5 iterations and similarity threshold of 0.5
ConclusionsOur bootstrapping method uses few
electronic resources:A small seed setOne/multiple dictionariesA small corpus of half a million words
A large subjectivity lexicon of approx. 4000 entries was extracted
Using an unsupervised rule-based classifier, a subjectivity F-measure of 66.20% and an overall F-measure of 61.69% can be achieved
Questions?