3

Click here to load reader

Linguistic component Sentiment Analyzer for the Russian language

Embed Size (px)

DESCRIPTION

Sentiment Analyzer for processing generic texts as well as tweets in Russian. Attributes to three classes {NEGATIVE, NEUTRAL, POSITIVE} and detetcts subjectivity / objectivity. Both modes can be run with and without keywords describing a target object (for example brand name).

Citation preview

Page 1: Linguistic component Sentiment Analyzer for the Russian language

Linguistic Component: Sentiment Analyzer for the Russian language

Technical description SemanticAnalyzer Group, 2013-08-30 www.semanticanalyzer.info

This document describes technical details of sentiment analyzer for the Russian language. The component has several modes of operation:

Processing of generic texts: news, technical articles etc

Processing of Twitter messages

Processing of above two types of texts for generic background sentiment

Processing of above two types of texts for a set of multi-word synonyms representing a target object

The sentiment analyzer is based on two other linguistic components: tokenizer and lemmatizer (see their respective Technical descriptions). Beside attributing to one of three classes {NEGATIVE, NEUTRAL, POSITIVE} the analyzer is capable of analyzing objectivity / subjectivity of an input message.

Demo package sent upon request contains the following:

Java library of sentiment analyzer in a form of a binary

Polarity dictionaries

run_sentiment_engine.sh script for swift checking the functionality of the module

messages_to_detect_sentiment.txt file containing examples of generic text and tweets for sentiment attribution using the run_sentiment_engine.sh script

The algorithm is based on a set of rules, that compactly model flow of sentiment within an input message. The synonym matching can be strong and fuzzy (accomodating misspellings of an object name in a text).

Speed of processing Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server 480 characters/ms 70 tokens/ms

Tests were conducted in a single thread on 63 511 tweet messages with 2 527 227 words and 17 350 258 characters. Total time of execution: 36170 ms.

Format of the messages_to_detect_sentiment.txt file

This file describes input data for the sentiment analyzer for demo purposes.

Format: Text OR

Page 2: Linguistic component Sentiment Analyzer for the Russian language

Text\tKeyword comma separated list Text contains textual data in Russian for detecting sentiment \t – tab symbol Keyword comma separated list is a list of object synonyms to detect sentiment against.

Examples of detecting sentiment

The run_sentiment_engine.sh script will generate the following file: messages_to_detect_sentiment.out.

For the following input file messages_to_detect_sentiment.txt: Мне понравился новый iPhone, но вот GalaxyS неудобный. iPhone (sentence: ”I liked new iPhone, but GalaxyS is unhandy” with the object described with the keyword ”iPhone”) This output gets generated: Мне понравился новый iPhone, но вот GalaxyS неудобный. iPhone [iphone] POSITIVE For the following input file messages_to_detect_sentiment.txt: Мне понравился новый iPhone, но вот GalaxyS неудобный. GalaxyS (same sentence, but with the object described with the keyword ”GalaxyS”) This output gets generated: Мне понравился новый iPhone, но вот GalaxyS неудобный. GalaxyS [galaxys] NEGATIVE

Examples of using the library from the Java code

public void testdetectPolarityOfText() throws Exception {

SentimentEngine sentimentEngine = new SentimentEngine(new

File("conf/sentiment-module.properties"));

sentimentEngine.setVerbose(true);

// variants of the same brand McCafe in Russian tweets

String synonyms[] = {"МсCafe", "maccafe", "маккафе", "\"мак кафе\"",

"маккафэ", "\"мак кафэ\""};

List<List<String>> synonymsList = new ArrayList<List<String>>();

for(String synonym: synonyms) {

List<String> curSynonym = new ArrayList<String>();

curSynonym.add(synonym);

synonymsList.add(curSynonym);

}

// tweet message: ”We were in McCafe today! Unbelievable tasty cakes,

but damn, they are so big!!”

SynonymSentiment synonymSentiment =

Page 3: Linguistic component Sentiment Analyzer for the Russian language

sentimentEngine.detectPolarityOfTextForSynonyms("ох сегодня

были в МакКафе! безумно вкусные пирожные, но блии н они ж гиганские!!",

synonymsList);

assertEquals(true, synonymSentiment.isSynonymFound());

assertEquals(Enumerations.Sentiment.POSITIVE,

synonymSentiment.getSentimentTag());

}

This test case should pass, i.e. the detected sentiment for a set of object synonyms is going to be POSITIVE.