Atalaya at TASS 2018: Sentiment Analysis with Tweet ...ceur-ws.org/Vol-2172/p1_atalaya_tass2018.pdf · Atalaya at TASS 2018: Sentiment Analysis with Tweet Embeddings and Data Augmentation

Atalaya at TASS 2018: Sentiment Analysis withTweet Embeddings and Data Augmentation

Atalaya en TASS 2018: Analisis de Sentimiento conEmbeddings de Tweets y Aumentacion de Datos

Franco M. Luque1, Juan Manuel Perez2

1Universidad Nacional de Cordoba & CONICET2Universidad de Buenos Aires & CONICET

[email protected], [email protected]

Resumen: El workshop TASS 2018 propone diferentes desafıos de analisissemantico del Espanol. Este trabajo presenta nuestra participacion con el equipoAtalaya en la tarea de clasificacion de polaridad de tweets. Seguimos tecnicasestandar de preprocesamiento, representacion y clasificacion, y tambien exploramosalgunas ideas novedosas. En particular, para obtener embeddings de tweets entre-namos word embeddings con informacion de subpalabras, y usamos un esquemade pesaje para promediarlos. Para lidiar con problemas de sobreajuste causadospor la escasez de datos de entrenamiento, probamos una estrategia de aumentacionde datos basada en traduccion automatica bidireccional. Experimentos con clasi-ficadores lineales y modelos neuronales muestran resultados competitivos para lasdiferentes subtareas propuestas en el desafıo.Palabras clave: Analisis de Sentimiento, Clasificacion de Polaridad, Embeddings,Aumentacion de Datos, Modelos Lineales, Redes Neuronales

Abstract: TASS 2018 workshop proposes different challenges on semantic analy-sis in Spanish. This work presents our participation as team Atalaya in the task ofpolarity classification of tweets. We followed standard techniques in preprocessing,representation and classification, and also explored some novel ideas. In particu-lar, to obtain tweet embeddings we trained subword-aware word embeddings anduse a weighted scheme to average them. To deal with overfitting problems causedby training data scarcity, we tried a data augmentation strategy based on two-waymachine translation. Experiments with linear classifiers and neural models showcompetitive results for the different subtasks proposed in the challenge.Keywords: Sentiment Analysis, Polarity Classification, Embeddings, Data Aug-mentation, Linear Models, Neural Networks

1 Introduction

The TASS workshop presents every year dif-ferent challenges related to sentiment analy-sis in Spanish. One of the main tasks is polar-ity classification of tweets and tweet aspects.In particular, task 1 of TASS 2018 (Martınez-Camara et al., 2018) proposes polarity clas-sification on tweet datasets from three differ-ent Spanish speaking countries: Spain (ES),Costa Rica (CR) and Peru (PE). This arti-cle describes our participation in TASS 2018task 1 with team Atalaya. We present polar-ity classification systems using standard tech-niques and propose improvements based on

an iterative experimental development pro-cess. We tried different approaches for tweetpreprocessing, vector representation and po-larity classification models. Standard pre-processing techniques, including text sim-plification, stopword filtering, lemmatizationand negation handling were used. Tweetswere represented with bag-of-words, bag-of-characters, tweet embeddings and combina-tions of these. As classification models, weconsidered linear classifiers and neural net-works.

We used fastText subword-aware wordvectors using tweet datasets specifically pre-

TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 29-35

ISSN 1613-0073 Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes.

pared for the task. Tweet vectors were com-puted from word vectors using a weighted av-eraging scheme, with weights inversely pro-portional to word frequency.

To cope with scarcity of training data, weexperimented with a data augmentation trickbased on translation of training data to otherlanguages and back to Spanish.

Embedding weighting and data augmen-tation represent novel approaches in the con-text of TASS. In experiments, both ideasshowed improvements in prediction qualityfor some configurations.

The rest of the paper is as follows: Nextsection describes the main techniques and re-sources we tried; section 3 presents the exper-imental development of the systems, describ-ing explored configurations and final modelsselection; and section 4 summarizes our fi-nal results for the competition, and addressesconclusions and future work.

2 Techniques and Resources

This section describes the main techniquesand resources we used to define the basiccomponents to build our systems.

2.1 Preprocessing

Preprocessing is crucial in NLP applica-tions, specially when working with noisyuser-generated data.

We divided preprocessing into a two-stageprocess: First, we defined basic tweet pre-processing, using well-known standard andgeneral purpose techniques; then, we definedsentiment-oriented preprocessing, using tech-niques that try to emphasize semantic infor-mation.

Basic tweet preprocessing includes:

• Tokenization using NLTK tweet tok-enizer (Bird and Loper, 2004).

• Replacement of handles with token’@USER’, URLs with ’URL’, and e-mailswith ’[email protected]’.

• Replacement of four or more repeatedletters with three letters.

Sentiment-oriented preprocessing includesthe following additional steps:

• Lowercasing.

• Removal of stopwords, using NLTKSpanish stopword list.

• Removal of numbers.

• Lemmatization using TreeTagger(Schmid, 1995).

• Simple negation handling: We find nega-tion words and add the prefix ’NOT ’ tothe following tokens. Up to three tokensare negated, or less if a non-word tokenis found. (Das et al., 2001; Pang, Lee,and Vaithyanathan, 2002)

• Removal of punctuation.

• Removal of consecutive repetitions ofhandles and URLs.

No treatment was performed to hash-tags, emojis, interjections and onomatopeias.Moreover, no spelling correction nor anyother additional normalization was applied.

2.2 Bags of Words and Characters

The simplest approach we considered to buildtweet representations was the bag-of-wordsencoding. A bag-of-words (BOW) builds fea-ture vectors for each token seen in trainingdata. For a particular tweet, its BOW vec-tor contains the number of ocurrences of eachtoken in the tweet. Resulting vectors arehigh-dimensional and sparse. Variations ofBOWs include counting not only single to-kens but also n-grams of tokens, binarizingcounts, and limiting the number of features.

Character usage in tweets may also holduseful information for sentiment analysis.Character n-grams —such as presence andrepetition of uppercase letters, emoticons andexclamation marks— may indicate strongpresence of sentiment of some kind, whereothers may indicate a more formal writingstyle, and therefore an absence of sentiment.

To capture this information, we consid-ered a bag-of-characters (BOC) representa-tion that encodes counts of character n-gramsfor some values of n. These vectors are com-puted from original texts of tweets, with nopreprocessing at all. BOCs have the samevariants and parameters as BOWs.

2.3 Word Embeddings

Word embeddings are low-dimensional densevector representations of words (Mikolov etal., 2013). These representations encode syn-tactical and semantical relations of words,useful for NLP tasks, and they can be learnedin an unsupervised fashion using large quan-tities of plain text, providing high vocabularycoverage. When precomputed embeddings

Franco M. Luque y Juan Manuel Pérez

30

are used as features in supervised tasks, theyprovide robust information for words that arerare or unseen in training data. This is par-ticularly useful when training data is scarce,as in this competition.

Recent work on embeddings introducedthe usage of subword information to com-pute word vectors. Informative representa-tions for out-of-vocabulary (OOV) words canbe obtained from subword embeddings. OOVwords are an important issue when workingwith highly noisy data such as user generateddata in social networks. Here, the need fortext normalization in preprocessing can bealleviated with subword-based embeddings.

In our work, we used fastText subword-based embeddings library (Bojanowski et al.,2016). Instead of using pretrained vectors,we decided to train our own embeddings onTwitter data.

To address the multilingual character ofthe challenge, we first collected a databaseof ∼90 million tweets from various Spanish-speaking countries, including the ones con-cerning the challenge. Then, we preparedtwo versions of the data, one using only basicpreprocessing, and the other one using senti-ment oriented preprocessing (only exceptinglemmatization). For these two datasets, wetrained skipgram embeddings using differentparameter configurations, including the num-ber of dimensions, size of word and subwordn-grams and size of context window.

2.4 Tweet Embeddings

There are a number of ways of using wordembeddings for sentiment analysis on tweets:approaches go from simple averaging of vec-tors for each word in the tweet, to the useof more complex architectures such as CNNsor RNNs. In this work, we used averag-ing to compute a single tweet embeddingof same dimensionality as the original wordembeddings. We followed two simple ap-proaches: plain averaging and weighted av-eraging. For weighted averaging, we used ascheme that resembles Smooth Inverse Fre-quency (SIF) Arora, Liang, and Ma (2017),inspired by TF-IDF reweighting. Each wordw is weighted with a

a+p(w) , where p(w) is the

word unigram probability, and a is a smooth-ing hyper-parameter. Big values of a meansmore smoothing towards plain averaging.

We also considered two options that af-fect tweet embeddings: binarization, which

ignores token repetitions in tweets; and nor-malization, which scales resulting tweet vec-tors to have unit norm.

2.5 Data Augmentation

As the amount of training instances wassmall, we paid special attention to modelregularization. A technique used to ad-dress this is data augmentation, which con-sists of creating new synthetic instances outof real ones by applying label-preservingtransformations. This overfitting-reductionstrategy is widely used in Computer Vision(Krizhevsky, Sutskever, and Hinton, 2012;Simard, Steinkraus, and Platt, 2003) andSpeech Recognition (Jaitly and Hinton, 2013;Ko et al., 2015). For instance, images can bezoomed, cropped, rotated, etc., while keepingthe objects in it still recognizable.

Data augmentation in NLP is a more sub-tle problem: there are no straightforwardinvariant-transformations such as in Com-puter Vision. A common technique (Zhang,Zhao, and LeCun, 2015) is to replace wordswith synonyms using a thesaurus.

In this work we adopted a novel tech-nique successfully used in a recent KaggleNLP competition.1 The technique consists oftranslating the texts to a different language,and then translating them back to the orig-inal one. This process results in tweets thatvary lexically and syntactically, while mostlykeeping its meaning.

The tool selected to do this work wasGoogle Translate, and the languages usedas intermediates were English, French, Por-tuguese and Arabic. We discarded other op-tions (e.g. Mandarin Chinese) as they greatlyaltered the meaning of tweets. Table 1 dis-plays examples of tweets and the resultingartificial instances.

3 Systems Development

This section describes the polarity classifica-tion systems we developed using the tools in-troduced in the previous section.

We worked on two type of classifiers: lin-ear classifiers and neural networks. In bothcases, we tried to do some kind of model se-lection, at times using development as the op-timization target, and at other times usingcross-validation on the combination of trainand development.

1https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/52557

Atalaya at TASS 2018: Sentiment Analysis with Tweet Embeddings and Data Augmentation

31

Original Tweet Data-augmented tweetsGracias por la informacion. Parece que el olor hacesado. Ayer paso lo mismo sobre la misma hora

- Gracias por la informacion. Parece que el olor se ha detenido. Ayersucedio lo mismo al mismo tiempo- Gracias por la informacion. Parece que el olor se ha detenido. Ayer, lomismo ocurrio al mismo tiempo

Muy buenas amigos! Como podemos contactarcon ustedes

- ¡Muy buenos amigos! ¿Como podemos ponernos en contacto con usted?- Muy buenos amigos! ¿Como podemos contactarlo?

La verdad es que tiene buena pinta. Investigare,gracias

- La verdad es que parece bueno. Voy a investigar, gracias- La verdad es que se ve bien. Voy a investigar, gracias- El hecho es que se ven bien. Lo comprobare, gracias

Table 1: Data augmentation examples. Left column shows original tweets, right column showsresults of two-way translations for several intermediate languages.

Next subsections describe the experimen-tal development and the best configurationswe found for both types of system

3.1 Linear Classifiers

We first built a classifying pipeline using sim-ple linear classifying models —such as logis-tic regressions and SVMs— that were imple-mented with scikit-learn (Pedregosa et al.,2011). Next, we describe the model selec-tion process, done almost entirely using theInterTASS ES corpus.

As input features, we combined the threerepresentations described in the previous sec-tion: bag-of-words, bag-of-characters andtweet embeddings.

For the bag of words and characters, earlyexperiments showed a clear advantage of bi-nary values over counts, together with TF-IDF re-weighting. First choices for n-gramranges were (1, 2) for words and (1, 3) forcharacters.

For the embeddings, sentiment-orientedword vectors showed an advantage over ba-sic vectors. We tried embeddings of dimen-sions 50, 100, 200 and 300. Best results werefound with 50 dimensions, and there were nostatistically significant differences.

To compute tweet embeddings, we triedbasic averaging (as provided by fastText) andthe weighted averaging scheme described insection 2.4. We experimented with smooth-ing values a = 10n for n ∈ {−3, . . . , 3} re-sulting in a significant advantage of usinga = 0.1. Here, binarization and normaliza-tion as described in section 2.4 showed betterresults.

For the classifier, we tried logistic regres-sions (LRs) and linear-kernel SVMs. To al-leviate the class imbalance problem, train-ing items were weighted according to the in-verse of the class frequency. Both LR andlinear SVM hyper-parameters were selectedtargeting the optimization of accuracy and

Model BOW BOC M-F1 Acc.LR (1, 2) (1, 3) 0.496 0.634LR+DA (1, 2) (1, 3) 0.490 0.615LR (1, 5) (1, 6) 0.493 0.634LR+DA (1, 5) (1, 6) 0.529 0.648

Table 2: Experiments with logistic regres-sions (LR), showing the interaction of train-ing data augmentation (DA) with n-gramsize ranges for bags of words and characters(BOW and BOC, resp.). Results are on In-terTASS ES development set.

Macro-F1 over InterTASS ES developmentset. In particular, the best regularization pa-rameters found were C = 1.0 for LRs, andC = 0.05 for SVMs. Logistic regressions wereselected over SVMs as they performed consis-tently better in all experiments.

When adding augmented data, first re-sults showed a significant degradation in ac-curacy. However, an exploration of parame-ter values showed that it allowed an improve-ment in performance when increasing therange of n-gram sizes considered for BOWsand BOCs. Best results were found with upto 5-grams for words, and up to 6-grams forcharacters. Tab. 2 shows how data augmen-tation combined with bigger n-gram rangesimproved results.

Most previous parameter selection was re-viewed after data augmentation, confirmingselected values. We also tried adding train-ing data from General TASS corpus, to findthat this was harmful for our models. Withthe optimal models found in this process wesubmitted final results for the Spanish (ES)monolingual task.

For Costa Rica (CR) and Peru (PE)monolingual tasks, same values than forES were used for most parameters. Onlyweighted averaging, data augmentation andn-gram ranges were explored. In CR data,weighting improved results, with the peak at


32

InputLayer

dense_1: Dense

dropout_1: Dropout

dense_2: Dense

dropout_2: Dropout

Output: Softmax

Figure 1: Architecture of the MLP.

a = 0.5. Data augmentation also was good,with the best results using up to 4-grams forwords and 6-grams for characters. In PEdata, neither weighting nor data augmenta-tion were helpful. Best results were foundusing up to 2-gram for words and 5-gramsfor characters.

3.2 Multilayer Perceptron

In the second set of experiments we used mul-tilayer perceptrons (MLP) neural networks.MLPs performed well in previous editions ofthe challenge (Dıaz-Galiano et al., 2018).

Fig. 1 displays the chosen architecture,consisting of two hidden layers and a softmaxoutput. ReLU units were used as activationfunctions in the hidden layers. To avoid over-fitting, we tried dropout (Srivastava et al.,2014) and early stopping.

To find the best configurations, we per-formed random search (Bergstra and Ben-gio, 2012) using 5-fold cross-validation overthe InterTASS ES training and developmentdatasets. The explored configurations andhyperparameters were:

• BOW features: No BOW features at all,top-50 or top-150.

• Tweet embeddings: Basic or weightedaveraging.

• Hidden layers: Different number of neu-

Task Model M-F1 Acc.

Mono ESMLP 0.476 0.544LR 0.468 0.599

Mono CRMLP 0.451 0.562LR 0.475 0.582

Mono PEMLP 0.437 0.520LR 0.462 0.451

Cross Lingual ESMLP

0.441 0.485Cross Lingual PE 0.438 0.523Cross Lingual CR 0.453 0.565

Table 3: Submitted results for each subtask.

rons and keep-probabilities2.

Results of this search showed that bag-of-words features and embedding weighting didnot improve performance. Regarding theMLP architecture, we selected 256 as thesize of the first layer and 128 for the sec-ond, and keep-probabilities of 0.25 and 0.55respectively. This configuration was used inall subtasks.

Data augmentation in combination withMLPs showed mixed results. For the mono-lingual ES subtask, using synthetic data re-sulted in a Macro-F1 gain while for monolin-gual PE it degraded the results.

All monolingual models were trained us-ing the respective train sections of InterTASSdatasets. General TASS was not used as itnot showed improvements. For cross-lingualtasks, models for each language were trainedusing the datasets for the two other lan-guages.

We used Keras (Chollet and others, 2015)to implement the model and scikit-learn (Pe-dregosa et al., 2011) to perform the cross-validation.

4 Conclusions and Future Work

We presented our participation on TASS 2018task 1 as team Atalaya. We explored stan-dard approaches as well as some simple butoriginal recent ideas such as data augmenta-tion and word embedding weighting. Table3 displays results for each subtask. Our sys-tems ranked among the first three in all thesubtasks.

Experiments show that competitive re-sults can be achieved without having to re-sort to complex neural architectures such asCNNs, RNNs, LSTMs, etc. Even simple lo-

2Probability of keeping the value of a neuron whentraining with dropout.


33

gistic regressions were able to rank among thetop performing systems.

Future work includes further explorationon data augmentation, tweet embeddingtechniques, and sentiment-oriented word em-beddings. We also aim at improving prepro-cessing and adopting modern neural classify-ing models.

References

Arora, S., Y. Liang, and T. Ma. 2017. Asimple but tough-to-beat baseline for sen-tence embeddings.

Bergstra, J. and Y. Bengio. 2012. Ran-dom search for hyper-parameter optimiza-tion. Journal of Machine Learning Re-search, 13(Feb):281–305.

Bird, S. and E. Loper. 2004. Nltk: the nat-ural language toolkit. In Proceedings ofthe ACL 2004 on Interactive poster anddemonstration sessions, page 31. Associa-tion for Computational Linguistics.

Bojanowski, P., E. Grave, A. Joulin, andT. Mikolov. 2016. Enriching word vectorswith subword information. arXiv preprintarXiv:1607.04606.

Chollet, F. et al. 2015. Keras.

Das, S. R., M. Y. Chen, T. V. Agar-wal, C. Brooks, Y. shee Chan, D. Gib-son, D. Leinweber, A. Martinez-jerez,P. Raghubir, S. Rajagopalan, A. Ranade,M. Rubinstein, and P. Tufano. 2001. Ya-hoo! for amazon: Sentiment extractionfrom small talk on the web. In 8th AsiaPacific Finance Association Annual Con-ference.

Dıaz-Galiano, M. C., E. Martınez-Camara,M. Angel Garcıa Cumbreras, M. G. Vega,and J. V. Roman. 2018. The democratiza-tion of deep learning in tass 2017. Proce-samiento del Lenguaje Natural, 60(0):37–44.

Jaitly, N. and G. E. Hinton. 2013. Vocaltract length perturbation (vtlp) improvesspeech recognition. In Proc. ICML Work-shop on Deep Learning for Audio, Speechand Language, volume 117.

Ko, T., V. Peddinti, D. Povey, and S. Khu-danpur. 2015. Audio augmentation forspeech recognition. In Sixteenth AnnualConference of the International SpeechCommunication Association.

Krizhevsky, A., I. Sutskever, and G. E. Hin-ton. 2012. Imagenet classification withdeep convolutional neural networks. InAdvances in neural information processingsystems, pages 1097–1105.

Martınez-Camara, E., Y. Almeida Cruz,M. C. Dıaz-Galiano, S. Estevez Velarde,M. A. Garcıa-Cumbreras, M. Garcıa-Vega, Y. Gutierrez Vazquez, A. Mon-tejo Raez, A. Montoyo Guijarro,R. Munoz Guillena, A. Piad Morffis,and J. Villena-Roman. 2018. Overviewof TASS 2018: Opinions, health andemotions. In E. Martınez-Camara,Y. Almeida Cruz, M. C. Dıaz-Galiano, S. Estevez Velarde, M. A.Garcıa-Cumbreras, M. Garcıa-Vega,Y. Gutierrez Vazquez, A. Montejo Raez,A. Montoyo Guijarro, R. Munoz Guillena,A. Piad Morffis, and J. Villena-Roman,editors, Proceedings of TASS 2018: Work-shop on Semantic Analysis at SEPLN(TASS 2018), volume 2172 of CEURWorkshop Proceedings, Sevilla, Spain,September. CEUR-WS.

Mikolov, T., I. Sutskever, K. Chen, G. Cor-rado, and J. Dean. 2013. Distributedrepresentations of words and phrasesand their compositionality. CoRR,abs/1310.4546.

Pang, B., L. Lee, and S. Vaithyanathan.2002. Thumbs up? sentiment classifi-cation using machine learning techniques.In Proceedings of the 2002 Conference onEmpirical Methods in Natural LanguageProcessing, pages 79–86. Association forComputational Linguistics, July.

Pedregosa, F., G. Varoquaux, A. Gram-fort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot,and E. Duchesnay. 2011. Scikit-learn:Machine learning in Python. Journalof Machine Learning Research, 12:2825–2830.

Schmid, H. 1995. Improvements in part-of-speech tagging with an application togerman. In In Proceedings of the ACLSIGDAT-Workshop, pages 47–50.

Simard, P. Y., D. Steinkraus, and J. C. Platt.2003. Best practices for convolutional


34

neural networks applied to visual docu-ment analysis. In null, page 958. IEEE.

Srivastava, N., G. Hinton, A. Krizhevsky,I. Sutskever, and R. Salakhutdinov. 2014.Dropout: a simple way to prevent neuralnetworks from overfitting. The Journal ofMachine Learning Research, 15(1):1929–1958.

Zhang, X., J. Zhao, and Y. LeCun. 2015.Character-level convolutional networks fortext classification. In C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, andR. Garnett, editors, Advances in NeuralInformation Processing Systems 28. Cur-ran Associates, Inc., pages 649–657.


35

Documents

Atalaya at TASS 2018: Sentiment Analysis with Tweet ...ceur-ws.org/Vol-2172/p1_atalaya_tass2018.pdf · Atalaya at TASS 2018: Sentiment Analysis with Tweet Embeddings and Data Augmentation