5
Towards Multilingual Automated Classification Systems Aibek Musaev 1 and Calton Pu 2 1 Department of Computer Science, University of Alabama 2 School of Computer Science, Georgia Institute of Technology Abstract—In this paper we propose and evaluate three ap- proaches for automated classification of texts in over 60 lan- guages without the need for a manually annotated dataset in those languages. All approaches are based on the randomized Explicit Semantic Analysis method using multilingual Wikipedia articles as their knowledge repository. We evaluate the proposed approaches by classifying a Twitter dataset in English and Portuguese into relevant and irrelevant items with respect to landslide as a natural disaster, where the highest achieved F1-score is 0.93. These approaches can be used in various applications where multilingual classification is needed, including multilingual disaster reporting using Social Media to improve coverage and increase confidence. As illustration, we present a demonstration that combines data from physical sensors and social networks to detect landslide events reported in English and Portuguese. I. I NTRODUCTION In a supervised learning setting, human labels are necessary, but they may be costly to obtain [1]. This may be especially problematic in situations that attract multilingual population, for example during disasters. In this paper, we study a specific example of such situations, namely disaster reporting based on Social Media as the use of Social Media rises during disasters [2]. Note however, that the data from Social Media often contains noise due to ambiguity of the search keywords used to collect them [3]. The problem domain that we study is the classification of data collected from Twitter into items that are relevant to landslide as a natural disaster and items that are irrelevant. The following is a list of frequent examples of tweets in English that describe landslide events that involve the movement of soil: Heavy rains caused a landslide that destroyed houses in Taiz. 15 people from 7 families died. JUST IN: Mesa County increases alert level at massive Collbran landslide site: http://dpo.st/24bgWSm However, one of the challenges involving the use of key- words to collect data from Social Media is that they are often polysemous words, which may have one or more irrelevant meanings. The following is a list of frequent examples of irrelevant topics involving the use of the word landslide: Elections are a numbers game. 2016 will be the biggest landslide since Reagan beat Carter. Make America Great Again. Is this the real life? Is this just fantasy? Caught in a landslide, No escape from reality... Similarly, the keywords used to collect data from Social Media in Portuguese on landslide events, are also polysemous words as shown in Table I. In this paper we propose three approaches for automated classification of texts, namely multilingual concepts, offline translation and online translation. For evaluation purposes we use a Twitter dataset collected using landslide keywords in English and Portuguese. Finally, we present a demonstration that displays the landslide data from multiple sources on a web map shown in Figure 1. II. FRAMEWORK OVERVIEW In this section, we present three approaches, namely Mul- tilingual Concepts, Offline Translation and Online Translation for an automated classification of texts in virtually any lan- guage. Each approach is based on a randomized ESA clas- sification system, which reduces Explicit Semantic Analysis (ESA) method to speed up text classification. We provide an overview of the ESA method followed by a description of the randomized ESA classification system. Finally, we describe the proposed approaches, which further extend the ESA method to support automated classification of texts in any language supported by Wikipedia. A. ESA Overview Explicit Semantic Analysis (ESA) is a powerful method for representing the meaning of texts in a high-dimensional space of Wikipedia articles [4]. Given a text fragment, for example, “Bernanke takes charge”, ESA generates the following top articles that are highly relevant to the input — Ben Bernanke, Federal Reserve, Chairman of the Federal Reserve, Alan Greenspan (Bernanke’s predecessor), Monetarism (an eco- nomic theory of money supply and central banking), inflation and deflation [5]. ESA splits the text into words and each word is converted to a high-dimensional real-valued vector space. In this vector space each dimension corresponds to an article in Wikipedia. The value for each dimension in the vector is computed as the strength of association between a word and each Wikipedia article. One way to define such association strength is to use a common TF-IDF approach. Finally, the centroid of the vectors representing the individual words is computed.

Towards Multilingual Automated Classification Systemsaibek.cs.ua.edu/files/Musaev_icdcs17_max.pdf · 2017-03-30 · Towards Multilingual Automated Classification Systems Aibek Musaev1

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards Multilingual Automated Classification Systemsaibek.cs.ua.edu/files/Musaev_icdcs17_max.pdf · 2017-03-30 · Towards Multilingual Automated Classification Systems Aibek Musaev1

Towards Multilingual Automated ClassificationSystems

Aibek Musaev1 and Calton Pu2

1Department of Computer Science, University of Alabama2School of Computer Science, Georgia Institute of Technology

Abstract—In this paper we propose and evaluate three ap-proaches for automated classification of texts in over 60 lan-guages without the need for a manually annotated dataset inthose languages. All approaches are based on the randomizedExplicit Semantic Analysis method using multilingual Wikipediaarticles as their knowledge repository. We evaluate the proposedapproaches by classifying a Twitter dataset in English andPortuguese into relevant and irrelevant items with respect tolandslide as a natural disaster, where the highest achievedF1-score is 0.93. These approaches can be used in variousapplications where multilingual classification is needed, includingmultilingual disaster reporting using Social Media to improvecoverage and increase confidence. As illustration, we present ademonstration that combines data from physical sensors andsocial networks to detect landslide events reported in Englishand Portuguese.

I. INTRODUCTION

In a supervised learning setting, human labels are necessary,but they may be costly to obtain [1]. This may be especiallyproblematic in situations that attract multilingual population,for example during disasters. In this paper, we study a specificexample of such situations, namely disaster reporting basedon Social Media as the use of Social Media rises duringdisasters [2]. Note however, that the data from Social Mediaoften contains noise due to ambiguity of the search keywordsused to collect them [3].

The problem domain that we study is the classification ofdata collected from Twitter into items that are relevant tolandslide as a natural disaster and items that are irrelevant. Thefollowing is a list of frequent examples of tweets in Englishthat describe landslide events that involve the movement ofsoil:

• Heavy rains caused a landslide that destroyed houses inTaiz. 15 people from 7 families died.

• JUST IN: Mesa County increases alert level at massiveCollbran landslide site: http://dpo.st/24bgWSm

However, one of the challenges involving the use of key-words to collect data from Social Media is that they are oftenpolysemous words, which may have one or more irrelevantmeanings. The following is a list of frequent examples ofirrelevant topics involving the use of the word landslide:

• Elections are a numbers game. 2016 will be the biggestlandslide since Reagan beat Carter. Make America GreatAgain.

• Is this the real life? Is this just fantasy? Caught in alandslide, No escape from reality...

Similarly, the keywords used to collect data from SocialMedia in Portuguese on landslide events, are also polysemouswords as shown in Table I.

In this paper we propose three approaches for automatedclassification of texts, namely multilingual concepts, offlinetranslation and online translation. For evaluation purposes weuse a Twitter dataset collected using landslide keywords inEnglish and Portuguese. Finally, we present a demonstrationthat displays the landslide data from multiple sources on aweb map shown in Figure 1.

II. FRAMEWORK OVERVIEW

In this section, we present three approaches, namely Mul-tilingual Concepts, Offline Translation and Online Translationfor an automated classification of texts in virtually any lan-guage. Each approach is based on a randomized ESA clas-sification system, which reduces Explicit Semantic Analysis(ESA) method to speed up text classification. We providean overview of the ESA method followed by a descriptionof the randomized ESA classification system. Finally, wedescribe the proposed approaches, which further extend theESA method to support automated classification of texts inany language supported by Wikipedia.

A. ESA Overview

Explicit Semantic Analysis (ESA) is a powerful method forrepresenting the meaning of texts in a high-dimensional spaceof Wikipedia articles [4]. Given a text fragment, for example,“Bernanke takes charge”, ESA generates the following toparticles that are highly relevant to the input — Ben Bernanke,Federal Reserve, Chairman of the Federal Reserve, AlanGreenspan (Bernanke’s predecessor), Monetarism (an eco-nomic theory of money supply and central banking), inflationand deflation [5].

ESA splits the text into words and each word is convertedto a high-dimensional real-valued vector space. In this vectorspace each dimension corresponds to an article in Wikipedia.The value for each dimension in the vector is computed as thestrength of association between a word and each Wikipediaarticle. One way to define such association strength is to use acommon TF-IDF approach. Finally, the centroid of the vectorsrepresenting the individual words is computed.

Page 2: Towards Multilingual Automated Classification Systemsaibek.cs.ua.edu/files/Musaev_icdcs17_max.pdf · 2017-03-30 · Towards Multilingual Automated Classification Systems Aibek Musaev1

Fig. 1. Landslide reporting in multiple languages

Tweet in Portuguese Meaning in English

Chuva provoca deslizamento de terra em Rain causes landslide in Carapicuiba, Sao PauloCarapicuıba, na Grande SP. metropolitan area.

Apos deslizamento de encosta, mulher Woman dies buried after hillside sliding in Salvador.morre soterrada em Salvador.

17 pontos para entender o desabamento das Bolsas 17 points to understand the stock market collapsena China. in China.

Landslide nao sai da minha cabeca, ate quando eu I can’t get Landslide song out of my mind, eventenho que estudar o.o when I need to study.

TABLE IEXAMPLES OF LANDSLIDE TWEETS IN PORTUGUESE

B. Randomized ESA

The large size of Wikipedia repository makes it computa-tionally infeasible to apply ESA method for practical purposesdue to its high dimensionality. That is why several studieshave been conducted to understand or enhance ESA perfor-mance [6], [7], [8], and [9]. A random sample of Wikipediarepository as features for rapid text classification is proposedin [10]. The reduced size of Wikipedia repository is computedbased on the sample size for a proportion when samplingwithout replacement. Given Z-score=1.96 for 95% confidencelevel, N=5,322,499, and ε=0.02, the sample size n shouldbe 2,400, where N is the number of articles in the EnglishWikipedia1.

C. Proposed approaches

We propose three approaches for automated classification oftexts in any language, namely Multilingual Concepts, Offline

1http://en.wikipedia.org/wiki/Wikipedia:Size of Wikipedia

Translation and Online Translation — see an overview inTable II.

Multilingual Concepts. The ESA method represents themeaning of texts based on their association strength toWikipedia articles. The Wikipedia articles have titles, whichare the concepts described in the articles. These articlesalso frequently contain multilingual versions, for examplethe article on landslides has more than 60 versions in otherlanguages2. In other words, the same concept is representedin multiple languages. We use this fact to build ESA matricesin different languages. Note that these multilingual versionsdescribe the same concept. Hence, our hypothesis is that avector generated using the ESA method for a text discussinga particular concept in one language will have a similar patternas a vector generated for a text discussing the same conceptin another language. This means that given an ESA matrixbased on Wikipedia articles in one language (e.g., English), we

2https://en.wikipedia.org/wiki/Landslide

Page 3: Towards Multilingual Automated Classification Systemsaibek.cs.ua.edu/files/Musaev_icdcs17_max.pdf · 2017-03-30 · Towards Multilingual Automated Classification Systems Aibek Musaev1

can generate a corresponding ESA matrix in another language(e.g., Portuguese) using the Portuguese version of the samearticles, and apply the same classifier model that was builtusing the English based ESA to classify data in Portuguese.There are two advantages of this approach: there is no needfor the manual annotation of a training dataset in anotherlanguage, and there is no need for the translation of a trainingor evaluation datasets.

As shown in Table II, in the Multilingual Concepts ap-proach, we build the randomized ESA in the original language,e.g., English, then we use it to generate vectors in the samelanguage, so we can build a classifier model. Next we buildthe randomized ESA in the new language, e.g., Portuguese,apply it to generate vectors in Portuguese, and then we reusethe classifier model generated using English for classificationof vectors generated using Portuguese.

Offline Translation. In this approach, we propose to trans-late the training dataset in the original language, e.g., English,to the new language, such as Portuguese. Then we buildthe randomized ESA in the new language, and use it togenerate training vectors in the same language, so that wecan build a classifier model. Finally, we use the built modelfor classification of vectors generated using the new language.Unlike the Multilingual Concepts approach, here we needto translate the training dataset from the original languageto the new language. This is a one-time operation, whichcan be performed offline, so it should not affect the runtimeperformance of this approach. Again, there is no need for themanual annotation of the training dataset in the new language.

Online Translation. In this approach, we propose totranslate the evaluation dataset from the new language, e.g.,Portuguese, to the original language, e.g., English. Then wecan reuse the classifier model that was already built for theoriginal language. The advantage of this approach is that we donot need to annotate the training dataset in the new language.However, each time we get a new text for classificationpurposes, such as an incoming tweet, we need to translateit to the original language. This can become expensive bothperformance and money wise given a high rate of incomingtweets.

III. EXPERIMENTAL EVALUATION

A. Dataset description

For experimental evaluation, we consider the performanceof the proposed approaches in a specific problem domain,namely the disambiguation of landslide data collected fromSocial Media.

We use keywords landslide and mudslide to collect thedata from Twitter in English and keywords deslizamento anddesabamento to collect the data in Portuguese. Each tweet ismanually labeled as either relevant or irrelevant to landslideas a natural disaster. To mark items as relevant we use twoapproaches. First we check whether a given text describes alandslide event confirmed by the authoritative source, namely

USGS3. Otherwise, we search for a confirmation of thelandslide event online by other trustworthy sources using thedescribed event as a search query.

Irrelevant items are much easier to label as we only need tocheck whether a given item uses other meanings of the wordlandslide, including an overwhelming victory in election or anexcerpt from the lyrics. See an overview of the dataset usedin the experiments in Table III. We use 2,400 multilingualarticles in Wikipedia, which have both English and Portugueseversions. In other words, they describe the same concepts, butin two different languages. Our training dataset consists of En-glish tweets and the evaluation dataset consists of Portuguesetweets only.

B. Experimental setup

In order to generate ESA matrices, we download 2,400random articles from Wikipedia, such that each article has atleast 100 words and has both English and Portuguese versions.We use Microsoft Translator API4 to translate the trainingdataset from English to Portuguese and the evaluation datasetfrom Portuguese to English. For evaluation of classificationalgorithms, we use the Weka software package [11]. We selectSMO, Logistic Regression, Naive Bayes, J48, and RandomForest classifiers for our analysis. They represent differentcategories of classification algorithms and we determine thealgorithms that demonstrate the best performance for each ofthe proposed approaches.

C. Results

The performance of the Multilingual Concepts approach isshown in Figure 2. Note that, Logistic Regression demon-strates the best performance among the selected classificationalgorithms based on its F1-score. The high value of the F1-score is mainly due to its strong recall. So, if we can improvethe precision of this algorithm, then F1-score can also improve.

The Offline Translation performance shown in Figure 3demonstrates uneven performance between the selected clas-sification algorithms. For example, the Naive Bayes classifierhas the worst F1-score among all the proposed approaches dueto its very low recall. SMO is the best classification algorithmfor the Online Translation approach as can be seen in Figure 4.

We compare all three approaches in Figure 5 based onthe best classification algorithm for each approach. Althoughthe Online Translation algorithm has the best performanceoverall, but the Multilingual Concepts approach demonstratespromising results with less computations, which necessitatesfurther analysis.

IV. DEMONSTRATION

A screenshot of the demonstration is provided in Figure 1.It shows the data not only from Twitter, but also from bothphysical sensors and additional social networks related tolandslide events on a web map. The data from physical sensors

3http://www.usgs.gov/4https://msdn.microsoft.com/en-us/library/dd576287.aspx

Page 4: Towards Multilingual Automated Classification Systemsaibek.cs.ua.edu/files/Musaev_icdcs17_max.pdf · 2017-03-30 · Towards Multilingual Automated Classification Systems Aibek Musaev1

Approach Randomized ESA Training Dataset Evaluation Dataset

Multilingual Concepts En, Pt En Pt

Offline Translation Pt En → Pt Pt

Online Translation En En Pt → En

TABLE IIOVERVIEW OF PROPOSED APPROACHES

Wikipedia articles Training dataset Evaluation dataset

English2,400

3,689 N/APortuguese N/A 3,241

TABLE IIIOVERVIEW OF DATASET

Tweet in Portuguese Meaning in English

Chuva provoca deslizamento de terra em Rain causes landslide in Carapicuiba, Sao PauloCarapicuıba, na Grande SP. metropolitan area.

Apos deslizamento de encosta, mulher Woman dies buried after hillside sliding in Salvador.morre soterrada em Salvador.

17 pontos para entender o desabamento das Bolsas 17 points to understand the stock market collapsena China. in China.

Landslide nao sai da minha cabeca, ate quando eu I can’t get Landslide song out of my mind, eventenho que estudar o.o when I need to study.

TABLE IVEXAMPLES OF LANDSLIDE TWEETS IN PORTUGUESE

Fig. 2. Classification performance of Multilingual Concepts approach

Fig. 3. Classification performance of Offline Translation approach

Fig. 4. Classification performance of Online Translation approach

Fig. 5. Comparison of proposed approaches

Page 5: Towards Multilingual Automated Classification Systemsaibek.cs.ua.edu/files/Musaev_icdcs17_max.pdf · 2017-03-30 · Towards Multilingual Automated Classification Systems Aibek Musaev1

include live feeds on earthquakes from USGS5 and on heavyrainfalls from TRMM6. There is also a static feed of the globallandslide hazard distribution.

The data from social networks include live feeds from Twit-ter, YouTube and Facebook. Users need to select a languagethat will filter the feeds from social networks by that language,such as Portuguese.

Users can view detailed information about items from eachfeed, e.g. they can read the contents of a tweet or watch avideo from YouTube. They can also see all of the items thatwere used to make a decision regarding a landslide for eachdetected event provided in the Events feed.

Finally, users can toggle the checkbox for any feed to viewthe data from a particular feed or a combination of feeds.

V. CONCLUSION AND FUTURE WORK

In this paper we propose and evaluate three approaches for amultilingual automated classification of texts in any languagewithout the need for a training dataset in that language. Weuse multilingual Wikipedia articles to build ESA matrices thatrepresent the meaning of any text as a weighted vector ofWikipedia-based concepts in different languages. With thisapproach, it is sufficient to have a single annotated trainingdataset in a language of choice to automatically supportclassification of texts in other languages. We successfullyapply the proposed approaches for disambiguation of landslidedata from Social Media in English and Portuguese and displaythe results on a web map.

Our future work includes a more extensive evaluation ofthe proposed approaches based on larger datasets, and wewill apply them to other problem domains and additionallanguages, such as Japanese and Chinese.

VI. ACKNOWLEDGMENTS

This research has been partially funded by National ScienceFoundation by CISE’s SAVI/RCN (1402266, 1550379), CNS(1421561), CRISP (1541074), SaTC (1564097) programs, anREU supplement (1545173), and gifts, grants, or contractsfrom Fujitsu, HP, Intel, and Georgia Tech Foundation throughthe John P. Imlay, Jr. Chair endowment. Any opinions, find-ings, and conclusions or recommendations expressed in thismaterial are those of the author(s) and do not necessarilyreflect the views of the National Science Foundation or otherfunding agencies and companies mentioned above.

REFERENCES

[1] M. Imran, C. Castillo, F. Diaz, and S. Vieweg, “Processing social mediamessages in mass emergency: A survey,” ACM Comput. Surv., vol. 47,no. 4, pp. 67:1–67:38, 2015.

[2] B. L. Fraustino, Julia Daisy and Y. Jin, “Social media use duringdisasters: A review of the knowledge base and gaps,” Final Report toHuman Factors/Behavioral Sciences Division, Science and TechnologyDirectorate, U.S. Department of Homeland Security. College Park, MD:START, 2012.

5http://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php6https://trmm.gsfc.nasa.gov/trmm rain/Events/latest 3 day landslide.html

[3] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes twitter users:real-time event detection by social sensors,” in Proceedings of the 19thInternational Conference on World Wide Web, WWW 2010, Raleigh,North Carolina, USA, April 26-30, 2010, 2010, pp. 851–860.

[4] E. Gabrilovich and S. Markovitch, “Computing semantic relatednessusing wikipedia-based explicit semantic analysis,” in IJCAI 2007,Proceedings of the 20th International Joint Conference on ArtificialIntelligence, Hyderabad, India, January 6-12, 2007, 2007, pp. 1606–1611.

[5] E. Gabrilovich and S. Markovitch, “Wikipedia-based semantic inter-pretation for natural language processing,” J. Artif. Intell. Res. (JAIR),vol. 34, pp. 443–498, 2009.

[6] M. Anderka and B. Stein, “The ESA retrieval model revisited,” inProceedings of the 32nd Annual International ACM SIGIR Conferenceon Research and Development in Information Retrieval, SIGIR 2009,Boston, MA, USA, July 19-23, 2009, 2009, pp. 670–671.

[7] O. Egozi, E. Gabrilovich, and S. Markovitch, “Concept-based featuregeneration and selection for information retrieval,” in Proceedings ofthe Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008,Chicago, Illinois, USA, July 13-17, 2008, 2008, pp. 1132–1137.

[8] M. Potthast, B. Stein, and M. Anderka, “A wikipedia-based multilingualretrieval model,” in Advances in Information Retrieval , 30th EuropeanConference on IR Research, ECIR 2008, Glasgow, UK, March 30-April3, 2008. Proceedings, 2008, pp. 522–530.

[9] P. Cimiano, A. Schultz, S. Sizov, P. Sorg, and S. Staab, “Explicitversus latent concept models for cross-language information retrieval,”in IJCAI 2009, Proceedings of the 21st International Joint Conferenceon Artificial Intelligence, Pasadena, California, USA, July 11-17, 2009,2009, pp. 1513–1518.

[10] A. Musaev, D. Wang, S. Shridhar, and C. Pu, “Fast text classificationusing randomized explicit semantic analysis,” in 2015 IEEE Interna-tional Conference on Information Reuse and Integration, IRI 2015, SanFrancisco, CA, USA, August 13-15, 2015, 2015, pp. 364–371.

[11] M. A. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, andI. H. Witten, “The WEKA data mining software: an update,” SIGKDDExplorations, vol. 11, no. 1, pp. 10–18, 2009.