1-s2.0-S0957417414003455-main

  • Upload
    khairul

  • View
    225

  • Download
    0

Embed Size (px)

DESCRIPTION

this is a nice

Citation preview

  • te

    , Tlogyy, 46

    tatirediave-rouge.volv

    text-mining and produce a picture of the generic components that they all have. We, furthermore,

    oday det econd equbout th

    ity of quantitative historic market data and the general desireamong traders for technical quantitative methods. Fundamentaldata is more challenging to use as input specially when it isunstructured. Fundamental data may come from structured andnumeric sources like macro-economic data or regular nancial

    s possiblen summar

    science concepts are reviewed and it is claried how they aretied into the currently proposed solutions for this researchproblem.

    2. The most signicant past literature has been reviewed with anemphasis on the cutting-edge pieces of work.

    3. The main differentiating factors among the current works areidentied, and used to compare and contrast the availablesolutions.

    Corresponding author. Tel.: +49 1626230377.E-mail address: [email protected] (A. Khadjeh Nassirtoussi).

    Expert Systems with Applications 41 (2014) 76537670

    Contents lists availab

    w

    .epany, etc. for the latter. Most of the research in the past has beendone on technical analysis approaches, mainly due to the availabil- 1. The relevant fundamental economic and computer/informationdata, with historic market data to be used for the former and anyother kind of information or news about the country, society, com-

    tion of the todays cutting-edge research and itdirections. The main contributions of this work ihttp://dx.doi.org/10.1016/j.eswa.2014.06.0090957-4174/ 2014 Elsevier Ltd. All rights reserved.futurey are:standingmarket movements primarily facilitates one with the abil-ity to predict future movements. Ability to predict in a marketeconomy is equal to being able to generatewealth by avoidingnan-cial losses and making nancial gains. However, the nature of mar-kets is as such that they are extremely difcult to predict if at all.

    In general the predictive measures are divided into technical orfundamental analyses. They are differentiated based on their input

    available in unstructured text is the most challenging researchaspect and therefore is the focus of this work. Some exampleswould be the fundamental data available online in the textualinformation in social media, news, blogs, forums, etc. (see Fig. 1).

    In this work a systematic review of the past works of researchwith signicant contribution to the topic of online-text-miningfor market-prediction has been conducted, leading to the clarica-1. Introduction

    The welfare of modern societies teconomies. At the heart of any markmarkets with their supply and demais crucial to studymarkets and learn acompare each system with the rest and identify their main differentiating factors. Our comparative anal-ysis of the systems expands onto the theoretical and technical foundations behind each. This work shouldhelp the research community to structure this emerging eld and identify the exact aspects whichrequire further research and are of special signicance.

    2014 Elsevier Ltd. All rights reserved.

    epends on their marketnomy, lies the nancialilibriums. Therefore, iteirmovements. Under-

    reports from banks and governments. Even this aspect offundamental data has been rarely researched; but occasionally ithas been demonstrated to be of predictive value as in the worksof Chatrath, Miao, Ramchander, and Villupuram (2014), KhadjehNassirtoussi, Ying Wah, and Ngo Chek Ling (2011) andFasanghari and Montazer (2010). Nevertheless, fundamental dataNews sentiment analysisFOREX market predictionStock prediction based on news

    gence. We dive deeper into the interdisciplinary nature and contribute to the formation of a clearframe of discussion. We review the related works that are about market prediction based on online-Review

    Text mining for market prediction: A sys

    Arman Khadjeh Nassirtoussi a,, Saeed Aghabozorgi aaDepartment of Information Science, Faculty of Computer Science & Information TechnobResearch & Higher Degrees, Sunway University, No. 5, Jalan University, Bandar Sunwa

    a r t i c l e i n f o

    Article history:Available online 9 June 2014

    Keywords:Online sentiment analysisSocial media text mining

    a b s t r a c t

    The quality of the interprenews can determine the pa number of researchers hHowever, there is no wellto the best of our knowleddisciplinary nature that in

    Expert Systems

    journal homepage: wwwmatic review

    eh Ying Wah a, David Chek Ling Ngo b

    , University of Malaya, 50603 Kuala Lumpur, Malaysia150 Petaling Jaya, Selangor DE, Malaysia

    on of the sentiment in the online buzz in the social media and the onlinectability of nancial markets and cause huge gains or losses. That is whyturned their full attention to the different aspects of this problem lately.nded theoretical and technical framework for approaching the problemWe believe the existing lack of such clarity on the topic is due to its inter-es at its core both behavioral-economic topics as well as articial intelli-

    le at ScienceDirect

    ith Applications

    lsevier .com/locate /eswa

  • only strongly efcient when all information is completely avail-

    ing

    Fig. 2. Interdisciplinary between linguistics, machine-learning and behavioral-

    ems4. Observations are made on the areas with lack of research whichcan constitute possible opportunities for future work.

    The rest of this paper is structured as follows. Section 2 providesinsight into the interdisciplinary nature of the research problem athand and denes the required foundational concepts for a robustcomprehension of the literature. Section 3 presents the review of

    economics.Predicon System

    Fig. 1. Online text-min

    Linguiscs Machine

    Learning

    Behavioral-

    economicsSocial Media, News,

    Blogs, Forums,

    7654 A. Khadjeh Nassirtoussi et al. / Expert Systthe main available work. Section 4 makes suggestions for futureresearch. And Section 5 concludes this work.

    2. A review of foundational interdisciplinary backgroundconcepts

    Essentially what is being targeted by this research is utilizationof computational modeling and articial intelligence in order toidentify possible relationships between the textual informationand the economy. That is the research problem.

    In order to address this research problem adequately at leastthree separate elds of study must be included namely: linguistics(to understand the nature of language), machine-learning (toenable computational modeling and pattern recognition),behavioral-economics (to establish economic sense) (see Fig. 2).

    The main premise is aligned with recent ndings of behavioral-economic principles whereby market conditions are products ofhuman behaviors involved (Bikas, Jureviciene, Dubinskas, &Novickyte, 2013; Tomer, 2007).

    Our research has identied the below background topics asessential to develop a robust comprehension of this researchproblem:

    2.1. Efcient Market Hypothesis (EMH)

    The idea that markets are completely random and are not pre-dictable is rooted in the efcient-market hypothesis (Fama, 1965)which asserts that nancial markets are informationally efcientand that consequently, one cannot consistently achieve returns inexcess of average market returns on a risk-adjusted basis, given the

    other predictable human errors in reasoning and information pro-

    cessing (Friesen & Weller, 2006).

    Behavioral nance and investor sentiment theory have rmlyestablished that investors behavior can be shaped by whether theyfeel optimistic (bullish) or pessimistic (bearish) about future mar-ket values (Bollen & Huina, 2011).

    2.3. Adaptive market hypothesis (AMH)

    The dilemma of market efciency with regards to its degree andapplicability to different markets is still a vibrant and ongoingtopic of research with very contradictory results. For every paperproducing empirical evidence supporting the market efciency, acontradictory paper can perhaps be found which empirically estab-able, which realistically is rarely the case. Hence, he conceded thathis theory is stronger in certain markets where the information isopenly, widely and instantly available to all participants and it getsweaker where such assumption cannot be held concretely in amarket.

    2.2. Behavioral-economics

    Cognitive and behavioral economists look at price as a purelyperceived value rather than a derivative of the production cost.The media do not report market status only, but they actively cre-ate an impact on market dynamics based on the news they release(Robertson, Geva, & Wolff, 2006; Wisniewski & Lambe, 2013).

    Peoples interpretation of the same information varies greatly.Market participants have cognitive biases such as overcondence,overreaction, representative bias, information bias, and variousinformation available at the time the investment is made. How-ever, this absolute hypothesis is found not to be completely accu-rate and Fama himself revises it to include 3 levels of efciency asstrong, semi-strong and weak (Fama, 1970). This indicates thatthere are many markets where predictability is plausible and via-ble and such markets are termed as weakly efcient. Market ef-ciency is correlated with information availability and a market is

    Market Movement Call (Up, Down, Steady, )

    for market-prediction.

    with Applications 41 (2014) 76537670lishes market inefciency (Majumder, 2013). Hence, a few yearsago research has produced a counter-theory by the name of Adap-tive Markets Hypothesis in an effort to reconcile Efcient Marketshypothesis with Behavioral Finance (Lo, 2005). Urquhart andHudson (2013) have conducted a comprehensive empirical investi-gation on the Adaptive Markets Hypothesis (AMH) in three of themost established stock markets in the world; the US, UK and Japa-nese markets using very long run data. Their research provides evi-dence for adaptive markets, with returns going through periods ofindependence and dependence although the magnitude of depen-dence varies quite considerably. Thus the linear dependence ofstock returns varies over time but nonlinear dependence is strongthroughout. Their overall results suggest that the AMH provides abetter description of the behavior of stock returns than theEfcient Market Hypothesis.

  • and fundamental data, however, fundamental data is usually of

    (Kleinnijenhuis, Schultz, Oegema, & Atteveldt, 2013). This specic

    ems2.4. Markets predictability

    When markets are weakly efcient then it must be possible topredict their behavior or at least determine criteria with predictiveimpact on them. Although, the nature of markets is as such thatonce such information is available, they absorb it and adjust them-selves and therefore become efcient against the initial predictivecriteria and thereby making them obsolete. Information absorptionbymarkets and reaching new equilibriums constantly occur inmar-kets and some researchers have delved into modeling its dynamicsand parameters under special circumstances (Garca & Uroevic,2013). Nonetheless, the existence of speculative economic bubblesindicates that the market participants operate on irrational andemotional assumptions and do not pay enough attention to theactual underlying value. The research of Pot and Siddique (2013)indicates existence of predictability in the foreign exchange market(FOREX) too. Although markets in the real world may not be abso-lutely efcient, research shows that some are more efcient thanthe others, as mentioned they can be strong, semi-strong and weakin terms of efciency (Fama, 1970). It has been demonstrated thatmarkets in the emerging economies like Malaysia, Thailand, Indo-nesia, and the Philippines tend to be signicantly less efcient com-pared to the developed economies and hence predictive measureslike technical trading rules seem to have more power (Yu, Nartea,Gan, & Yao, 2013). Additionally, the short-term variants of the tech-nical trading rules have better predictive ability than long-termvariants plus that markets become more informationally efcientover time (Yu, Nartea, et al., 2013). Hence, there is a need to contin-uously revisit the level of efciency of economically dynamic andrapidly growing emerging markets (Yu, Nartea, et al., 2013).

    2.5. Fundamental vs. technical analysis

    Those who are convinced that markets can be at least to someextend predicted are segmented in two camps. Those who believethat historic market movements are bound to repeat themselvesare known as technical analysts. They simply believe there arevisual patterns in a market graph that an experienced eye candetect. Based on this belief many of the graph movements arenamed which forms the basis of technical analysis. At a higher leveltechnical analysts try to detect such subtle mathematical modelsby the use of computation power and pattern recognition tech-niques. Although, technical analysis techniques are most widespread among many of the market brokers and participants, to ascientic mind with a holistic point of view, technical analysisalone cannot seem very attractive. Specially because most techni-cal analysis do not back any of their observations up with anythingmore than stating that patterns exist. They do very little if at all tond out the reason behind existence of patterns. Some of the com-mon techniques in technical analysis are the moving average rules,relative strength rules, lter rules and the trading range breakoutrules. In a recent study the effectiveness and limitations of theserules were put to test again and it was demonstrated that in manycases and contexts these rules are not of much predictive power(Yu, Nartea, et al., 2013). Nevertheless, there are and continue tobe many sophisticated nancial prediction modeling efforts basedon various types or combinations of machine learning algorithmslike neural networks (Anastasakis & Mort, 2009; Ghazali,Hussain, & Liatsis, 2011; Sermpinis, Laws, Karathanasopoulos, &Dunis, 2012; Vanstone & Finnie, 2010), fuzzy logic (Bahrepour,Akbarzadeh-T, Yaghoobi, & Naghibi-S, 2011), Support Vectorregression (Huang, Chuang, Wu, & Lai, 2010; Premanode &Toumazou, 2013), rule-based genetic network programming

    A. Khadjeh Nassirtoussi et al. / Expert Syst(Mabu, Hirasawa, Obayashi, & Kuremoto, 2013).However, there is a second school of thought known as

    fundamental analysis, which seems to be more promising. Instudy goes on to explore market panics from a nancial journalismperspective and communication theories specically in an age ofalgorithmic and frequency trading.

    2.6. Algorithmic trading

    Algorithmic trading refers to predictive mechanisms by intelli-gent robotic trading agents that are actively participating in everymoment of market trades. The speed of such decision making hasincreased dramatically recently and has created the new term fre-quency trading. Frequency trading has been more popular in thestock market and Evans, Pappas, and Xhafa (2013) are among therst who are utilizing articial neural networks and genetic algo-rithms to build an algo-trading model for intra-day foreignexchange speculation.

    2.7. Sentiment and emotional analysis

    It deals with detecting the emotional sentiment preserved intext through specialized semantic analysis for a variety of purposesfor instance to gauge the quality of market reception for a newproduct and the overall customer feedback or to estimate the pop-ularity of a product or brand (Ghiassi, Skinner, & Zimbra, 2013;Mostafa, 2013) among people. There is a body of research that isfocused on sentiment analysis or the so called opinion mining(Balahur, Steinberger, Goot, Pouliquen, & Kabadjov, 2009;Cambria, Schuller, Yunqing, & Havasi, 2013; Hsinchun & Zimbra,2010). It is mainly based on identifying positive and negativewords and processing text with the purpose of classifying its emo-tional stance as positive or negative. An example of such sentimentanalysis effort is the work of Maks and Vossen (2012) which pre-sents a lexicon model for deep sentiment analysis and opinionan unstructured nature and it remains to be a challenge to makethe best use of it efciently through computing. The research chal-lenge here is to deal with this unstructured data. One recentapproach that is emerging in order to facilitate such topicalunstructured data and extract structured data from it is the devel-opment of specialized search engines like this nancial newssemantic search engine (Lupiani-Ruiz et al., 2011). However, itremains to be a challenge to extract meaning in a reliable fashionfrom text and a search engine like the above is limited to extractingthe available numeric data in the relevant texts. One recent studyshows the impact of US news, UK news and Dutch news on threeDutch banks during the nancial crisis of 20072009fundamental analysis analysts look at fundamental data that isavailable to them from different sources and make assumptionsbased on that. We can come up with at least 5 main sources of fun-damental data: 1 nancial data of a company like data in its bal-ance sheet or nancial data about a currency in the FOREX market,2 nancial data about a market like its index, 3 nancial dataabout the government activities and banks, 4 political circum-stances, 5 geographical and meteorological circumstances likenatural or unnatural disasters. However, determining underlyingfundamental value of any asset may be challenging and with alot of uncertainty (Kaltwasser, 2010). Therefore, the automationof fundamental analysis is rather rare. Fasanghari and Montazer(2010) design a fuzzy expert system for stock exchange portfoliorecommendation that takes some of the company fundamentalsthrough numeric metrics as input.

    Most market participants try to keep an eye on both technical

    with Applications 41 (2014) 76537670 7655mining. A similar concept can be used in other area of research likeemotion detection in suicide notes as done by Desmet and Hoste(2013).

  • However, an analysis of emotional sentiment of news text canalso be explored for market prediction. Schumaker, Zhang,Huang, and Chen (2012) have tried to evaluate the sentiment innancial news articles in relation to the stock market in hisresearch but has not been completely successful. One more suc-cessful recent example of this would be (Yu, Wu, Chang, & Chu,2013), where a contextual entropy model was proposed to expanda set of seed words by discovering similar emotion words and theircorresponding intensities from online stock market news articles.This was accomplished by calculating the similarity between theseed words and candidate words from their contextual distribu-tions using an entropy measure. Once the seed words have beenexpanded, both the seed words and expanded words are used toclassify the sentiment of the news articles. Their experimentalresults show that the use of the expanded emotion wordsimproved classication performance, which was further improvedby incorporating their corresponding intensities which causedaccuracy results to range from 52% to 91.5% by varying the differ-ence of intensity levels from positive and negative classes from(0.5 to 0.5) to >9.5 respectively. It is also interesting to note that

    3. Review of the main available work

    Despite the existence of multiple systems in this area ofresearch we have not found any dedicated and comprehensivecomparative analysis and review of the available systems.Nikfarjam, Emadzadeh, and Muthaiyah (2010) have made an effortin form of a conference paper that provides a rough summaryunder the title of Text mining approaches for stock market predic-tion. Hagenau, Liebmann, and Neumann (2013) have alsopresented a basic comparative table in their literature review thathas been referred to in some parts of this work. In this section weare lling this gap by reviewing the major systems that have beendeveloped around the past decade.

    3.1. The generic overview

    All of these systems have some of the components depicted inFig. 3. At one end text is fed as input to the system and at the otherend some market predictive values are generated as output.

    In the next sections a closer look is taken at each of the depicted

    t -Maignm

    atusen

    7656 A. Khadjeh Nassirtoussi et al. / Expert Systems with Applications 41 (2014) 76537670emotional analysis of text does not have to be mere based on pos-itivitynegativity and it can be done on other dimensions or onmulti-dimensions (Ortigosa-Hernndez et al., 2012). A recent pieceof research by Loia and Senatore (2014) introduces a framework forextracting the emotions and the sentiments expressed in the tex-tual data. The sentiments are expressed by a positive or negativepolarity. The emotions are based on the Minskys conception ofemotions that consists of four affective dimensions (Pleasantness,Attention, Sensitivity and Aptitude). Each dimension has six levelsof activation, called sentic levels. Each level represents an emo-tional state of mind and can be more or less intense, dependingon the position on the corresponding dimension. Another interest-ing development in sentiment analysis is an effort to go fromsentiment of chunks of text to specic features or aspects thatare related to a concept or product; Kontopoulos, Berberidis,Dergiades, and Bassiliades (2013) propose a more efcient senti-ment analysis of Twitter posts, whereby posts are not simply char-acterized by a sentiment score but instead receive a sentimentgrade for each distinct notion in them which is enabled by the helpof an ontology. Li and Xu (2014) take on another angle by lookingfor features that are meaningful to emotions instead of simplychoosing words with high co-occurrence degree and thereby createa text-based emotion classication using emotion cause extraction.

    Textual Data

    Market Data Market Data Transformation

    Feature Selection

    News-Marke(Label Ass

    Dataset Pre-processing

    Dimensionality Reduction

    FeRepre

    Social Media

    Online News

    Blogs, Forums,..Fig. 3. Generic common systecomponents, their roles and theoretical foundation.

    3.2. Input dataset

    All systems are taking at least two sources of data as input,namely, the textual data from the online resources and the marketdata.

    3.2.1. Textual dataThe textual input can have several sources and content types

    accordingly as shown in Table 1. The majority of the sources usedare major news websites like The Wall Street Journal (Werner &Myrray, 2004), Financial Times (Wuthrich et al., 1998), Reuters(Pui Cheong Fung, Xu Yu, & Wai, 2003), Dow Jones, Bloomberg(Chatrath et al., 2014; Jin et al., 2013), Forbes (Rachlin, Last,Alberg, & Kandel, 2007) as well as Yahoo! Finance (Schumakeret al., 2012). The type of the news is either general news or specialnancial news. The majority of the systems are using nancialnews as it is deemed to have less noise compared with generalnews. What is being extracted here is the news text or the newsheadline. News headlines are occasionally used and are argued tobe more straight-to-the-point and, hence, of less noise caused byverbose text (Huang, Liao, et al., 2010; Peramunetilleke & Wong,

    pping ent )

    Model Training Model Evaluation

    Machine Learning

    re tationm components diagram.

  • l, Finoom.ch

    xtraBu

    ial T

    view

    ers.cw Jows

    ion a-Q

    spap

    emsTable 1Comparison of the textual input for different systems.

    Reference Text type Text source

    Wuthrich et al. (1998) General news The Wall Street JournaReuters, Dow Jones, Bl

    Peramunetilleke and Wong(2002)

    Financial news HFDF93 via www.olsen

    Pui Cheong Fung et al. (2003) Company news Reuters Market 3000 EWerner and Myrray (2004) Message postings Yahoo! Finance, Raging

    JournalMittermayer (2004) Financial news Not mentionedDas and Chen (2007) Message postings Message boardsSoni et al. (2007) Financial news FT Intelligence (Financ

    service)Zhai et al. (2007) Market-sector news Australian Financial Re

    Rachlin et al. (2007) Financial news Forbes.com, today.reutTetlock et al. (2008) Financial news Wall Street Journal, Do

    Service from Factiva neMahajan et al. (2008) Financial news Not mentionedButler and Keelj (2009) Annual reports Company websitesSchumaker and Chen (2009) Financial news Yahoo FinanceLi (2010) Corporate lings Managements Discuss

    section of 10-K and 10Edgar Web site

    Huang, Liao, Yang, Chang, andLuo (2010) and Huang,

    Financial news Leading electronic new

    A. Khadjeh Nassirtoussi et al. / Expert Syst2002). Fewer researchers have looked at less formal sources of tex-tual information e.g. Das and Chen (2007) have looked at the texton online message boards in their work. And more recently Yu,Duan, and Cao (2013) have looked at the textual content fromthe social media like twitter and blog posts. Some researchersfocus solely on twitter and utilize it for market prediction and pub-lic mood analysis more efciently (Bollen & Huina, 2011; Vu,Chang, Ha, & Collier, 2012). A third class of textual source for thesystems has been the company annual reports, press releases andcorporate disclosures. We observe that a difference about this classof information about companies is the nature of their timing,whereby regular reports and disclosures have prescheduled times.Pre-set timing is suspected to have an impact on prediction capa-bility or nature which may have been caused by anticipatory reac-tions among market participants (Chatrath et al., 2014; Huang,Liao, et al., 2010, Huang, Chuang, et al., 2010). Therefore, we haveincluded a column for this information in the table. It is also impor-tant to remember that some textual reports can have structured orsemi-structured formats like macroeconomic announcement thatcome from governments or central banks on the unemploymentrates or the Gross Domestic Product (GDP). Chatrath et al. (2014)have used such structured data to predict jumps in the foreignexchange market (FOREX). Furthermore, with regards to informa-tion release frequency, one interesting observation that has beenmade in the recent research is that increase of the news releasefrequency can cause a decrease in informed-trading and hence

    Chuang, et al. (2010)Groth and Muntermann (2011) Adhoc

    announcementsCorporate disclosures

    Schumaker et al. (2012) Financial news Yahoo! FinanceLugmayr and Gossen (2012) Broker newsletters BrokersYu, Duan, et al. (2013) Daily conventional

    and social mediaBlogs, forums, news and mTwitter)

    Hagenau et al. (2013) Corporateannouncementsand nancial news

    DGAP, EuroAdhoc

    Jin et al. (2013) General news BloombergChatrath et al. (2014) Macroeconomic

    newsBloomberg

    Bollen and Huina (2011) Tweets TwitterVu et al. (2012) Tweets TwitterNo. of items Prescheduled Unstructured

    ancial Times,berg

    Not given No Yes

    40 headlines per hour No Yes

    600,000 No Yesll, Wall Street 1.5 million messages No Yes

    6602 No Yes145,110 messages No Yes

    imes online 3493 No Yes

    148 direct company news and68 indirect ones

    No Yes

    om Not mentioned No Yesnes Newsdatabase.

    350,000 stories No Yes

    700 news articles No YesNot mentioned Yes Yes2800 No Yes

    nd Analysislings from SEC

    13 million forward-looking-statements in 140,000 10-Qand K lings

    Yes(companyannualreport)

    Yes

    ers in Taiwan 12,830 headlines No Yes

    with Applications 41 (2014) 76537670 7657the degree of information asymmetry is lower for rms with morefrequent news releases (Sankaraguruswamy, Shen, & Yamada,2013). This raises an interesting point whereby the uninformedtraders increase and thereby play a signicant role in the marketwith too frequent news releases. Although it may feel counter-intuitive as one may expect more informed trading to occur undersuch circumstances.

    3.2.2. Market dataThe other source of input data for the systems comes from the

    numeric values in nancial markets in form of price-points orindexes. This data is used mostly for purpose of training themachine learning algorithms and occasionally it is used for predic-tion purposes whereby it is fed into the machine learning algo-rithm as an independent variable for a feature, this topic will bediscussed in a later section. In Table 2, crucial details of such mar-ket data are provided. Firstly, there is a differentiation madebetween the stock markets and the foreign exchange market(FOREX). Past research has been mostly focused on stock marketprediction, either in form of a stock market index like the DowJones Industrial Average (Bollen & Huina, 2011; Werner &Myrray, 2004; Wuthrich et al., 1998), the US NASDAQ Index(Rachlin et al., 2007), Morgan Stanley High-Tech Index (MSH)(Das & Chen, 2007), the Indian Sensex Index (Mahajan, Dey, &Haque, 2008), S&P 500 (Schumaker & Chen, 2009) or the stockprice of a specic company like BHP Billiton Ltd. (BHP.AX) (Zhai,

    423 disclosures No Yes

    2802 No YesNot available No Yes

    icro blogs (e.g. 52,746 messages No Yes

    10870 and 3478 respectively No Yes

    361,782 No YesNot mentioned Yes No

    9,853,498 No Yes5,001,460 No Yes

  • Tim

    Dai

    Intr3 hDaivarIntr1 hDai

    emsTable 2The input market data, experiment timeframe, length and forecast type.

    Reference Market Market index

    Wuthrich et al. (1998) Stocks Dow Jones Industrial Average, the Nlkkei225, the Financial Times 100, the HangSeng, and the Singapore Straits

    Peramunetilleke andWong (2002)

    FOREX Exchange rate(USD-DEM, USD-JYP)

    Pui Cheong Fung et al.(2003)

    Stocks 33 stocks from the Hang Seng

    Werner and Myrray(2004)

    Stocks Dow Jones Industrial Average, and the DowJones Internet

    Mittermayer (2004) Stocks Stock prices

    7658 A. Khadjeh Nassirtoussi et al. / Expert SystHsu, & Halgamuge, 2007) or like Apple, Google, Microsoft andAmazon in Vu et al. (2012) or a group of companies (Hagenauet al., 2013). The FOREX market has been only occasionallyaddressed in about ten percent of the reviewed works; like in thework of Peramunetilleke and Wong (2002) and more recently inthe works of Chatrath et al. (2014) and Jin et al. (2013). It is worthconsidering that the efciency levels of the FOREX markets aroundthe world vary (Wang, Xie, & Han, 2012), and hence, it should bepossible to nd less efcient currency pairs that are prone topredictability.

    Moreover, almost all the forecast types on anyof themarketmea-sures above are categorical with discrete values like Up, Down andSteady. There are very few pieces of research that have explored anapproach based on linear regression (Jin et al., 2013; Schumakeret al., 2012; Tetlock, Saar-Tsechansky, & Macskassy, 2008).

    Das and Chen (2007) Stocks 24 tech-sectors in the Morgan Stanley High-Tech

    Dai

    Soni et al. (2007) Stocks 11 oil and gas companies Dai

    Zhai et al. (2007) Stocks BHP Billiton Ltd. from Australian StockExchange

    Dai

    Rachlin et al. (2007) Stocks 5 stocks from US NASDAQ Day

    Tetlock et al. (2008) Stocks Individual S&P 500 rms and their futurecash ows

    Dai

    Mahajan et al. (2008) Stocks Sensex DaiButler and Keelj

    (2009)Stocks 1-Year market drift Yea

    Schumaker and Chen(2009)

    Stocks S&P 500 stocks Intr

    Li (2010) Stocks (1) Index (2) Quarterly earnings and cashows (3) Stock returns

    Ann(Qudum

    Huang, Liao, et al.(2010) and Huang,Chuang, et al.(2010)

    Stocks Taiwan Stock Exchange Financial Price Dai

    Groth andMuntermann(2011)

    Stocks Abnormal risk exposure ARISKs (ThomsonReuters DataScope Tick History)

    Intrdurand

    Schumaker et al.(2012)

    Stocks S&P 500 Intr

    Lugmayr and Gossen(2012)

    Stocks DAX 30 performance index Intrtim

    Yu, Duan, et al. (2013) Stocks Abnormal returns and cumulative abnormalreturns of 824 rms

    Dai

    Hagenau et al. (2013) Stocks Company specic Dai

    Jin et al. (2013) FOREX Exchange rate Dai

    Chatrath et al. (2014) FOREX Exchange rate Intr

    Bollen and Huina(2011)

    Stock DJIA Dai

    Vu et al. (2012) Stock Stock prices (at NASDAQ for AAPL, GOOG,MSFT, AMZN)

    Daie-frame Period Forecast type

    ly 6 December 19976 March1998

    Categorical: up, steady, down

    aday(1, 2, or)

    2227 September 1993 Categorical: up, steady, down

    ly (No delay &ying lags)

    1 October 200230 April2003

    Categorical: rise, drop

    aday(15-min,and 1 day)

    Year 2000 Categorical: buy, sell, and holdmessages

    ly 1 January31 December2002

    Categorical: good news, badnews, no movers

    with Applications 41 (2014) 76537670Furthermore, the predictive timeframe for each work is com-pared. The timeframe from the point of news release to a marketimpact observation can vary from seconds to days, weeks ormonths. The second or millisecond impact prediction is the ele-ment that fuels an entire industry by the name of micro-tradingin which special equipment and vicinity to the news source as wellas the market computer systems is critical; a work on tradingtrends has well explained such quantitative tradings (Chordia,Roll, & Subrahmanyam, 2011). Another name for the same conceptis High Frequency Trading which is explored in detail by Chordia,Goyal, Lehmann, and Saar (2013). Another similar term is Low-Latency trading which is amplied in the work of Hasbrouck andSaar (2013). However, past research has indicated sixty minutesto be looked at as a reasonable market convergence time toefciency (Chordia, Roll, & Subrahmanyam, 2005). Market

    ly July and August 2001 Aggregate sentiment index

    ly 1 January 199515 May2006

    Categorical: positive, negative

    ly 1 March 200531 May 2006 Categorical: up, down

    s 7 February7 May 2006 Categorical: up, slight-up,expected, slight-down, down

    ly 19802004 Regression of a measure calledneg.

    ly 5 August8 April Categoricalrly 20032008 Categorical: Over- or under-

    perform S&P 500 index overthe coming year

    aday (20 min) 26 October28 November2005

    Categorical: discrete numeric

    uallyarterly with 3my quarters)

    19942007 Categorical based on tone:positive, negative, neutral,uncertain

    ly JuneNovember 2005 Just signicance degreeassignment

    aday (volatilitying the s = 1530 min)

    1 August 200331 July 2005 Categorical: positive, negative

    aday (20 min) 26 October28 November2005

    Regression

    aday (3 or 4es per day)

    Not available Categorical: sentiment [1, 1];trend (bear, bull, neutral);trend strength (in %)

    ly 1 July30 September 2011 Categorical: positive ornegative

    ly 19972011 Categorical: positive ornegative

    ly 1 January31 December2012

    Regression

    aday(5 min) January 2005December2010

    Categorical: positive ornegative jumps

    ly 28 February19 December2008

    Regression analysis

    ly 1 April 201131 May 2011,online test 8 September26September 2012

    Categorical: up and down

  • feature-selection in text can make the classication or clusteringproblem extremely hard to solve by decreasing the efciency of

    emsconvergence refers to the period during which a market reacts tothe information that is made available and becomes efcient byreecting it fully. Information availability and distribution chan-nels are critical here and the research on the market efciency con-vergence time is ongoing (Reboredo, Rivera-Castro, Miranda, &Garca-Rubio, 2013). Most of the works compared in Table 2 havea daily timeframes followed by intraday timeframes which are inthe range of 5 min (Chatrath et al., 2014), 15 min (Werner &Myrray, 2004), 20 min (Schumaker & Chen, 2009) to 1, 2, or 3 h(Peramunetilleke & Wong, 2002). Experiment periods are also con-trasted with shortest being only 5 days in case of the work ofPeramunetilleke and Wong (2002) and up to multiple years withthe longest at 24 years from 1980 to 2004 by Tetlock (2007) fol-lowed by 14 years from 1997 to 2011 in the work of Hagenauet al. (2013), 13 years from 1994 to 2007 in the work of Li (2010)with the latter looking at an annual timeframe and the formersat daily timeframes. The remaining majority of the works take onan experiment period with a length of multiple months as detailedin Table 2.

    3.3. Pre-processing

    Once the input data is available, it must be prepared so that itcan be fed into a machine learning algorithm. This for the textualdata means to transform the unstructured text into a representa-tive format that is structured and can be processed by the machine.In data mining in general and text mining specically the pre-pro-cessing phase is of signicant impact on the overall outcomes(Uysal & Gunal, 2014). There are at least three sub-processes oraspects of pre-processing which we have contrasted in thereviewed works, namely: feature-selection, dimensionality-reduc-tion, feature-representation.

    3.3.1. Feature-selectionThe decision on features through which a piece of text is to be

    represented is crucial because from an incorrect representationalinput nothing more than a meaningless output can be expected.In Table 3 the type of feature selection from text for each of theworks is listed.

    In most of the literature the most basic techniques have beenused when dealing with text-mining-based market-prediction-problems as alluded to by Kleinnijenhuis et al. (2013) as well con-sistent with our ndings. The most common technique is the socalled bag-of-words which is essentially breaking the text upinto its words and considering each of the as a feature. As pre-sented in Table 3, around three quarters of the works are relyingon this basic feature-selection technique in which the order andco-occurrence of words are completely ignored. Schumaker et al.(2012) and Schumaker and Chen (2009) have explored two furthertechniques namely noun-phrases and named entities. In the for-mer, rst the words with a noun part-of-speech are identied withthe help of a lexicon and then using syntactic rules on the sur-rounding parts of speech, noun-phrases are detected. In the latter,a category system is added in which the nouns or the noun phrasesare organized. They used the so called MUC-7 framework of entityclassication, where categories include date, location, money,organization, percentage, person and time. However, they did nothave any success in improving their noun-phrase results via thisfurther categorization in their named-entities technique. On theother hand, Vu et al. (2012) successfully improve results by build-ing a Named Entity Recognition (NER) system to identify whether aTweet contains named entities related to their targeted companiesbased on a linear Conditional Random Fields (CRF) model. Another

    A. Khadjeh Nassirtoussi et al. / Expert Systless frequently used but interesting technique is the so calledLatent Dirichlet Allocation (LDA) technique used by Jin et al.(2013) as well as Mahajan et al. (2008) in Table 3 to categorizemost of the learning algorithms, this situation is widely knownas the curse of dimensionality (Pestov, 2013). In Table 3 underdimensionality-reduction the approach taken by each of the worksis pointed out. Zhai et al. (2007) does this by choosing the top 30concepts with the highest weights as the features instead of allavailable concepts. Mittermayer (2004) does this by ltering forthe top 1000 terms out of all terms. The most common approachhowever is setting a minimum occurrence limit and reducing theterms by selecting the ones reaching a number of occurrences(Butler & Keelj, 2009; Schumaker & Chen, 2009). Next commonapproach is using a predened dictionary of some sort to replacethem with a category name or value. Some of these dictionariesare specially put together by a market expert like the one usedby Wuthrich et al. (1998) or Peramunetilleke and Wong (2002)or they are more specic to a specic eld like psychology in thecase of Harvard-IV-4 which has been used in the work of Tetlocket al. (2008). And other times they are rather general use dictionar-ies like the WordNet thesaurus used by Zhai et al. (2007). At timesa dictionary or thesaurus is created dynamically based on the textcorpus using a term extraction tool (Soni, van Eck, & Kaymak,2007). Another set of activities that usually constitute the mini-mum of dimensionality-reduction are: features stemming, conver-sion to lower case letters, punctuation removal and removal ofnumbers, web page addresses and stop-words. These steps arealmost always taken and in some works are the only steps takenas in the work by Pui Cheong Fung et al. (2003).

    Feature-reduction can be enhanced in a number of ways but thecurrent research has yet not delved into it much as observed in thewords into concepts and use the representative concepts as theselected features. Some of the other works are using a techniquecalled n-grams (Butler & Keelj, 2009; Hagenau et al., 2013). Ann-gram is a contiguous sequence of n items which are usuallywords from a given sequence of text. However, word sequenceand syntactic structures could essentially be more advanced. Anexample for such setup could be the use of syntactic n-grams(Sidorov, Velasquez, Stamatatos, Gelbukh, & Chanona-Hernndez,2013). However, inclusion of such features may give rise tolanguage-dependency problems which need to be dealt with(Kim et al., 2014). Combination techniques of lexical and semanticfeatures for short text classication may also be enhanced (Yang,Li, Ding, & Li, 2013).

    Feature selection is a standard step in the pre-processing phaseof data mining and there are many other approaches that can beconsidered for textual feature selection, genetic algorithms beingone of them as described in detail in the work of Tsai, Eberle,and Chu (2013). In another work Ant Colony Optimization has beensuccessfully used for textual feature-selection (Aghdam, Ghasem-Aghaee, & Basiri, 2009). Feature selection for text classicationcan also be done based on a lter-based probabilistic approach(Uysal & Gunal, 2012). Feng, Guo, Jing, and Hao (2012) propose agenerative probabilistic model, describing categories by distribu-tions, handling the feature selection problem by introducing a bin-ary exclusion/inclusion latent vector, which is updated via anefcient metropolis search. Delicate pre-processing has beenshown to have signicant impact in similar text mining problems(Haddi, Liu, & Shi, 2013; Uysal & Gunal, 2014).

    3.3.2. Dimensionality-reductionHaving a limited number of features is extremely important as

    the increase in the number of features which can easily happen in

    with Applications 41 (2014) 76537670 7659reviewed works. Berka and Vajteric (2013) introduce a detailedmethod for dimensionality reduction for text, based on parallelrare term vector replacement.

  • emsTable 3Pre-processing: feature-selection, feature-reduction, feature-representation.

    Reference Feature selection

    Wuthrich et al. (1998) Bag-of-words

    Peramunetilleke and Wong (2002) Bag-of-wordsPui Cheong Fung et al. (2003) Bag-of-words

    Werner and Myrray (2004) Bag-of-words

    Mittermayer (2004) Bag-of-wordsDas and Chen (2007) Bag-of-words, Triplets

    Soni et al. (2007) Visualization

    Zhai et al. (2007) Bag-of-words

    Rachlin et al. (2007) Bag-of-words, commonly usednancial values

    7660 A. Khadjeh Nassirtoussi et al. / Expert Syst3.3.3. Feature-representationAfter the minimum number of features is determined, each

    feature needs to be represented by a numeric value so that it canbe processed by machine learning algorithms. Hence, the titlefeature-representation in Table 3 is used for the column wherebythe type of numeric value that is associated with each feature iscompared for all the reviewed works. This assigned numeric valueacts like a score or a weight. There are at least 5 types for it that arevery popular, namely, Information Gain (IG), Chi-square Statistics(CHI), Document Frequency (DF), Accuracy Balanced (Acc2) andTerm Frequency-Inverse Document Frequency (TF-IDF). A compar-ison of these ve metrics along with proposals for some newmetrics can be found in the work of Tasc and Gngr (2013).

    The most basic technique is a Boolean or a binary representa-tion whereby two values like 0 and 1 represent the absence orpresence of a feature e.g. a word in the case of a bag-of-words tech-nique as in these works (Mahajan et al., 2008; Schumaker et al.,2012; Wuthrich et al., 1998). The next most common techniqueis the Term Frequency-Inverse Document Frequency or TF-IDF(Groth & Muntermann, 2011; Hagenau et al., 2013;Peramunetilleke & Wong, 2002; Pui Cheong Fung et al., 2003).

    Tetlock et al. (2008) Bag-of-words for negative words

    Mahajan et al. (2008) Latent Dirichlet Allocation (LDA)Butler and Keelj (2009) Character n-Grams, three readability

    scores, last years performanceSchumaker and Chen (2009) Bag of words, noun phrases, named

    entitiesLi (2010) Bag-of-words, tone and contentHuang, Liao, et al. (2010) and Huang,

    Chuang, et al. (2010)Simultaneous terms, ordered pairs

    Groth and Muntermann (2011) Bag-of-words

    Schumaker et al. (2012) OpinionFinder overall tone andpolarity

    Lugmayr and Gossen (2012) Bag-of-wordsYu, Duan et al. (2013) Bag-of-wordsHagenau et al. (2013) Bag-of-words, noun phrases, word-

    combinations, n-grams

    Jin et al. (2013) Latent Dirichlet Allocation (LDA)

    Chatrath et al. (2014) Structured DataBollen and Huina (2011) By OpinionFinderVu et al. (2012) Daily aggregate number of positives

    or negatives on Twitter SentimentTool (TST) and an emoticon lexicon.Daily mean of Pointwise MutualInformation (PMI) for pre-denedbullish-bearish anchor wordsDimensionality reduction Feature representation

    Pre-dened dictionaries (word-sequencesby an expert)

    Binary

    Set of keyword records Boolean, TF-IDF, TF-CDFStemming, conversion to lower-case,removal of punctuation, numbers, webpage addresses and stop-words

    TF-IDF

    Minimum information criterion (top 1000words)

    Binary

    Selecting 1000 terms TF-IDFPre-dened dictionaries Different discrete values for

    each classierThesaurus made using term extractiontool of N.J. van Eck

    Visual coordinates

    WordNet Thesaurus (stop-word removal,POS tagging, higher level concepts viaWordNet). Top 30 concepts

    Binary, TF-IDF

    Most inuential keywords list (Automaticextraction)

    TF, Boolean, Extractorsoftware output

    with Applications 41 (2014) 76537670The TF-IDF value increases proportionally to the number of timesa word appears in the document, but is offset by the frequencyof the word in the corpus, to balance out the general popularityof some words. There are also other similar measures that are occa-sionally used like the Term Frequency-Category Discrimination(TF-CDF) metric that is derived from Category Frequency (CF)and proves to be more effective than TF-IDF in this work(Peramunetilleke & Wong, 2002).

    Generally in text-mining enhanced feature-reduction(dimensionality-reduction) and feature weighting (feature-representation) can have signicant impact on the eventual text-classication efciency (Shi, He, Liu, Zhang, & Song, 2011).

    3.4. Machine learning

    After the pre-processing is completed and text is transformedinto a number of features with a numeric representation, machinelearning algorithms can be engaged. In the following a brief sum-mary of these algorithms is presented as well as a comparison onsome of the other detail technicalities in the designs of thereviewed systems.

    Pre-dened dictionary. Harvard-IV-4psychosocial dictionary.

    Frequency divided by totalnumber of words

    Extraction of twenty-ve topics BinaryMinimum occurrence per document Frequency of the n-gram in

    one proleMinimum occurrence per document Binary

    Pre-dened dictionaries Binary, Dictionary valueSynonyms replacement Weighted based on the rise/

    fall ratio of indexFeature scoring methods using bothInformation Gain and Chi-Squared metrics

    TF-IDF

    Minimum occurrence per document Binary

    Stemming Sentiment ValueNot mentioned BinaryFrequency for news, Chi2-approach andbi-normal separation (BNS) forexogenous-feedback-based featureselection, dictionary

    TF-IDF

    Topic extraction, top topic identicationby manually aligning news articles withcurrency uctuations

    Each articles topicdistribution

    Structured data Structured DataBy OpinionFinder By OpinionFinderPre-dened company related keywords,Named Entity Recognition based on linearconditional random elds (CRF)

    Real number for DailyNeg_Pos andBullish_Bearish

  • ems3.4.1. Machine learning algorithmsIn this section, it is attempted to provide a summary of the

    machine learning algorithms used in the reviewed works. It isnoted that comparing such algorithms is not easy and full of pit-falls (Salzberg, 1997). Therefore, the main objective is to reportwhat is used; so that it helps understand what lacks and may bepossible for future research. Almost all the machine learning algo-rithms that have been used are classication algorithms as listed inTable 4. Basically the systems are using the input data to learn toclassify an output usually in terms of the movement of the marketin classes such as Up, Down and Steady. However, there is also agroup of works which use regression analysis for making predic-tions and not classication.

    Table 4, categorizes the reviewed works based on their usedalgorithm into 6 classes:

    (A) Support Vector Machine (SVM).(B) Regression Algorithms.(C) Nave Bayes.(D) Decision Rules or Trees.(E) Combinatory Algorithms.(F) Multi-algorithm experiments.

    (A) Support Vector Machine (SVM): this section in Table 4contains the class of algorithms that the vast majorityof the reviewed works are using (Mittermayer, 2004;Pui Cheong Fung et al., 2003; Schumaker & Chen,2009; Soni et al., 2007; Werner & Myrray, 2004). SVMis a non-probabilistic binary linear classier used forsupervised learning. The main idea of SVMs is ndinga hyperplane that separates two classes with a maxi-mum margin. The training problem in SVMs can be rep-resented as a quadratic programming optimizationproblem. A common implementation of SVM that isused in many works (Mittermayer, 2004; Pui CheongFung et al., 2003) is SVM light (Joachims, 1999). SVMLight is an implementation of an SVM learner whichaddresses the problem of large tasks (Joachims, 1999).The optimization algorithms used in SVM Light aredescribed in Joachims (2002). Another commonly usedimplementation of SVM is LIBSVM (Chang & Lin, 2011).LIBSVM implements an SMO-type (Sequential MinimalOptimization) algorithm proposed in a paper by Fan,Chen, and Lin (2005). Soni et al. (2007) is using thisimplementation for prediction of stock price move-ments. Aside from the implementation, the input tothe SVM is rather unique in the work of Soni et al.(2007). They take the coordinates of the news items ina visualized document-map as features. A document-map is a low-dimensional space in which each newsitem is positioned on the weighted average of the coor-dinates of the concepts that occur in the news item.SVMs can be extended to nonlinear classiers by apply-ing kernel mapping (kernel trick). As a result of applyingthe kernel mapping, the original classication problemis transformed into a higher dimensional space. SVMswhich represent linear classiers in this high-dimen-sional space may correspond to nonlinear classiers inthe original feature space (Burges, 1998). The kernelfunction used may inuence the performance of theclassier. Zhai et al. (2007) are using SVM with a Gauss-ian RBF kernel and a polynomial kernel.

    (B) Regression Algorithms: they take on different forms inthe research efforts as listed in Table 4. One approach

    A. Khadjeh Nassirtoussi et al. / Expert Systis Support Vector Regression (SVR) which is a regres-sion based variation of SVM (Drucker, Burges,Kaufman, Smola, & Vapnik, 1997). It is utilized byHagenau et al. (2013) and Schumaker et al. (2012).Hagenau et al. (2013) use SVM primarily but they alsoassess the ability to predict the discrete value of thestock return using SVR. They predict returns and calcu-late the R2 (squared correlation coefcient) betweenpredicted and actually observed return. The optimiza-tion behind the SVR is very similar to the SVM, butinstead of a binary measure (i.e. positive or negative),it is trained on actually observed returns. While a binarymeasure can only be true or false, this measure givesmore weight to greater deviations between actual andpredicted returns than to smaller ones. As prots orlosses are higher with greater deviations, this measurebetter captures actual trading returns to be realized(Hagenau et al., 2013).Schumaker et al. (2012) choose to implement the SVRSequential Minimal Optimization (Platt, 1999) functionthrough Weka (Witten & Frank, 2005). This functionallows discrete numeric prediction instead of classica-tion. They select a linear kernel and ten-fold cross-vali-dation.Sometimes linear regression models are directly used(Chatrath et al., 2014; Jin et al., 2013; Tetlock et al.,2008). Tetlock et al. (2008) use OLS (Ordinary LeastSquare) method for estimating the unknown parametersin the linear regression model. They use two differentdependent variables (raw and abnormal next-dayreturns) regressed on different negative words mea-sures. Their main result is that negative words in rm-specic news stories robustly predict slightly lowerreturns on the following trading day.Chatrath et al. (2014) use a stepwise multivariateregression in a Probit (Probability unit) model. The pur-pose of the model is to estimate the probability that anobservation with particular characteristics will fall intoa specic category. In this case it is to ascertain the prob-ability of news releases that result in jumps. Jin et al.(2013) apply topic clustering methods and use custom-ized sentiment dictionaries to uncover sentiment trendsby analyzing relevant sentences. A linear regressionmodel estimates the weight for each topic and makescurrency forecasts.

    (C) Nave Bayes: it is the next algorithm used in a group ofworks in Table 4. It is probably the oldest classicationalgorithm (Lewis, 1998). But it is still very popular andis used among many of the works (Groth &Muntermann, 2011; Li, 2010; Werner & Myrray, 2004;Wuthrich et al., 1998). It is based on the Bayes Theo-rem and it is called nave because it is based on the naveassumption of complete independence between text fea-tures. It differentiates itself from approaches such as k-Nearest Neighbors (k-NN), Articial Neural Networks(ANN), or Support Vector Machine (SVM) in that it buildsupon probabilities (of a feature belonging to a certaincategory) whereas the other mentioned approachesinterpret the document feature-matrix spatially.Yu, Duan, and Cao (2013c) apply the Nave Bayes (NB)algorithm to conduct sentiment analysis to examinethe effect of multiple sources of social media along withthe effect of conventional media and to investigate theirrelative importance and their interrelatedness. Li (2010)uses Nave Bayes to examine the information content ofthe forward-looking statements in the Management

    with Applications 41 (2014) 76537670 7661Discussion and Analysis section of company lings. Heuses the Nave Bayes module in the Perl programminglanguage to conduct the computation.

  • Table 4Classication algorithms and other machine learning aspects.

    Reference Algorithm type Algorithm details Training vs.testing volumeand sampling

    Slidingwindow

    Semantics Syntax News andtech. data

    Software

    Pui Cheong Fung et al.(2003)

    SVM SVM-Light First 6 consecutivemonths vs. the lastmonth

    No No No No Not mentioned

    Mittermayer (2004) SVM-light 200 vs. 6002examples

    No No No No NewsCATS

    Soni et al. (2007) SVM withstandard linearkernel

    80% vs. 20% No Yes No No LibSVM package

    Zhai et al. (2007) SVM withGaussian RBFkernel andpolynomial kernel

    First 12 months vs.the remaining twomonths

    No Yes No Yes Not mentioned

    Schumaker and Chen(2009)

    SVM Not mentioned No Yes Yes Yes Arizona TextExtractor(AzTeK) & AZFinText.

    Lugmayr and Gossen(2012)

    SVM Not mentioned No Yes No Yes SentiWordNet

    Hagenau et al. (2013) RegressionAlgorithms

    SVM with a linearkernel, SVR

    Not mentioned No Yes Yes Yes Not mentioned

    Schumaker et al.(2012)

    SVR Not mentioned No Yes No Yes OpinionFinder

    Jin et al. (2013) Linear regressionmodel

    Previous day vs. agiven day(2 weeks forregression)

    Yes Yes No No Forex-foreteller,LoughranMcDonaldnancial dic.,AFINN dic.

    Chatrath et al. (2014) Stepwisemultivariateregression model

    Not applicable No No No No Not mentioned

    Tetlock et al. (2008) OLS regression 30 and 3 tradingdays prior to anearningsannouncement

    Yes Yes No No Harvard-IV-4psychosocialdictionary

    Yu, Duan, et al. (2013) Nave Bayes Nave Bayes Not mentioned No Yes No No Open-sourcenatural languagetoolkit (NLTK)

    Li (2010) Nave Bayes anddictionary-based

    30,000 randomlyvs. itself and therest.

    No No No No Diction, generalinquirer, thelinguisticinquiry, wordcount (LIWC)

    Peramunetilleke andWong (2002)

    Decision Rules orTrees

    Rule classier 22 September12:0027September 9:00vs. 9:0010:00 on27 September

    Yes Yes No No Not mentioned

    Huang, Liao, et al.(2010) and Huang,Chuang, et al. (2010)

    Weightedassociation rules

    2005 June2005October vs. 2005November

    No Yes Yes No Not mentioned

    Rachlin et al. (2007) C4.5 decision tree Not mentioned. No No No Yes Extractorsoftwarepackage

    Vu et al. (2012) C4.5 Decision tree Trained byprevious dayfeatures

    Yes Yes Yes No CRF++ toolkit,Firehose, TST,CMU POSTagger, AltaVista

    Das and Chen (2007) CombinatoryAlgorithms

    Combination ofdifferentclassiers

    1000 vs. the rest No Yes Yes No General Inquirer

    Mahajan et al. (2008) Stacked classier August 05December 07 vs.January 08Apr 08

    No Yes No No Not mentioned

    Butler and Keelj(2009)

    CNG distancemeasure & SVM &combined

    Year x vs. yearsx 1 and x 2.and all vectorrepresentationsvs. particulartesting year

    Yes No No Yes Perl n-grammoduleText::Ngramsdeveloped byKeselj. LIBSVM

    Bollen and Huina(2011)

    Self-organizingfuzzy neuralnetwork (SOFNN)

    28 February28November vs. 119 December 2008

    No N/A N/A No GPOMS,OpinionFinder

    7662 A. Khadjeh Nassirtoussi et al. / Expert Systems with Applications 41 (2014) 76537670

  • vs.olumplin

    traiforec

    ssag

    crons

    ems(D) Decision rules and trees: it is the next group of algo-rithms used in the literature as indicated in Table 4. Afew of the researchers have made an effort to createrule-based classication systems (Huang, Chuang,et al., 2010; Huang, Liao, et al., 2010; Peramunetilleke& Wong, 2002; Vu et al., 2012).Peramunetilleke and Wong (2002) use a set of keywordsprovided by a nancial domain expert. The classierexpressing the correlation between the keywords andone of the outcomes is a rule set. Each of the three rulesets (DOLLAR_UP, DOLLAR_STEADY, DOLLAR_DOWN)yields a probability saying how likely the respectiveevent will occur in relation to available keywords.Huang, Liao, et al. (2010) and Huang, Chuang, et al.(2010) observe that the combination of two or morekeywords in a nancial news headline might play a cru-cial role on the next trading day. They thus appliedweighted association rules algorithm to detect theimportant compound terms in the news headlines.Rachlin et al. (2007) use a Decision Tree Induction algo-rithm, which does not assume attribute independence.The algorithm is C4.5 developed by Quinlan (1993). Thisalgorithm yields a set of trend predicting rules. They fur-ther show the effect of the combination betweennumerical and textual data. Vu et al. (2012) also useC4.5 decision tree for the text binary classication prob-lem for predicting the daily up and down changes instock prices.For decision rule categorizers, the rules are composed ofwords, and words have meaning, the rules themselvescan be insightful. More than just attempting to assigna label, a set of decision rules may summarize how tomake decisions. For example, the rules may suggest apattern of words found in newswires prior to the riseof a stock price. The downside of rules is that they canbe less predictive if the underlying concept is complex(Weiss, Indurkhya, & Zhang, 2010). Although decisionrules can be particularly satisfying solutions for textmining, the procedures for nding them are more com-

    Table 4 (continued)

    Reference Algorithm type Algorithm details Trainingtesting vand sam

    Wuthrich et al. (1998) Multi-algorithmexperiments

    k-NN, ANNs, naveBayes, rule-based

    Last 100days to1 day

    Werner and Myrray(2004)

    Nave Bayes, SVM 1000 methe rest

    Groth andMuntermann (2011)

    Nave Bayes, k-NN,ANN, SVM

    Stratiedvalidatio

    A. Khadjeh Nassirtoussi et al. / Expert Systplicated than other methods (Weiss et al., 2010).Decision trees are special decision rules that are orga-nized into a tree structure. A decision tree divides thedocument space into non-overlapping regions at itsleaves, and predictions are made at each leaf (Weisset al., 2010).

    (E) Combinatory Algorithms: in Table 4, it is referring to aclass of algorithms which are composed of a numberof machine learning algorithms stacked or groupedtogether. Das and Chen (2007) have combined multipleclassication algorithms together by a voting system toextract investor sentiment. The algorithms are namely,Naive Classier, Vector Distance Classier,Discriminant-Based Classier, Adjective-Adverb PhraseClassier, Bayesian Classier. Accuracy levels turn outto be similar to widely used Bayes classiers, but falsepositives are lower and sentiment accuracy higher.Mahajan et al. (2008) identify and characterize majorevents that impact the market using a Latent DirichletAllocation (LDA) based topic extraction mechanism.Then a stacked classier is used which is a trainableclassier that combines the predictions of multiple clas-siers via a generalized voting procedure. The votingstep is a separate classication problem. They use adecision tree based on information gain for handlingnumerical attributes in conjunction with an SVM withsigmoid kernel to design the stacked classier. The aver-age accuracy of the classication system is 60%.Butler and Keelj (2009) propose 2 methods and thencombine them to achieve best performance. The rstmethod is based on character n-gram proles, whichare generated for each company annual report, and thenlabeled based on the common n-gram (CNG) classica-tion. The second method combines readability scoreswith performance inputs and then supplies them to aSupport Vector Machine (SVM) for classication. Thecombined version is setup to only make decisions whenthe models agreed.Bollen and Huina (2011) deploy a self-organizing fuzzyneural network (SOFNN) model to test the hypothesisthat including public mood measurements can improvethe accuracy of Dow Jones Industrial Average (DJIA) pre-diction models. A fuzzy neural network is a learningmachine that nds the parameters of a fuzzy system(i.e. fuzzy sets, fuzzy rules) by exploiting approximationtechniques from neural networks. Therefore it is classi-ed as a combinatory algorithm in Table 4.

    (F) Multi-algorithm experiments: it is another class ofworks in Table 4, whereby the same experiments areconducted using a number of different algorithms.

    Wuthrich et al. (1998) is one of the earliest works of research inthis area. They do not stack multiple algorithms together to form a

    eg

    Slidingwindow

    Semantics Syntax News andtech. data

    Software

    ningast

    Yes Yes No No Not mentioned

    es vs. No No No No Rainbowpackage

    ss No No No No Not mentioned

    with Applications 41 (2014) 76537670 7663bigger algorithm. However, they conduct their experiments usingmultiple algorithms and compare the results.

    Werner and Myrray (2004) also carry out all tests using twoalgorithms, Nave Bayes and SVM. Furthermore, Groth andMuntermann (2011) employ Nave Bayes, k-Nearest Neighbor (k-NN), Articial Neural Networks (ANN), and SVM in order to detectpatterns in the textual data that could explain increased risk expo-sure in stock markets.

    In conclusion, SVM has been extensively and successfully usedas a textual classication and sentiment learning approach whilesome other approaches like Articial Neural Networks (ANN), k-Nearest Neighbors (k-NN) have rarely been considered in the textmining literature for market prediction. This is also conrmed byMoraes, Valiati, and Gavio Neto (2013). Their research presents

  • emsan empirical comparison between SVM and ANN regarding docu-ment-level sentiment analysis. Their results show that ANN canproduce superior or at least comparable results to SVMs. Suchresults can provide grounds for looking into usage of other algo-rithms than the currently mostly used SVM. Li, Yang, and Park(2012) demonstrate high performance of k-NN for text categoriza-tion as well as Jiang, Pang, Wu, and Kuang (2012). Tan, Wang, andWu (2011) claim that for document categorization, centroidclassier performs slightly better than SVM classier and beats itin running time too. Gradojevic and Genay (2013) present a rarepiece of research that uses fuzzy logic to improve technical tradingsignals successfully on the EUR-USD exchange rates but fuzzy logicis rarely used in the reviewed works for marker prediction basedon text mining. Only Bollen and Huina (2011) use it in combinationwith neural networks to devise a self-organizing fuzzy neural net-work (SOFNN) successfully. However, Loia and Senatore (2014)show that it can be very useful for emotion modeling in general.Exploration of such under-researched algorithms in the contextof market-prediction may lead to new insights that may be ofinterest for future researchers.

    In order to better comprehend the reviewed systems in whichthe machine learning algorithms have been used, an additionalnumber of system properties have been reviewed in Table 4 whichare explained in the following sections.

    3.4.2. Training vs. testing volume and samplingIn this column, in Table 4, two aspects are summarized if the

    information were available. Firstly, the volume of the exampleswhich were used for training vs. the volume used for testing;around 70 or 80 percent for training vs. 30 or 20 percent for testingseems to be the norm. Secondly, if the sampling for training andtesting was of a special kind; what is specially of interest here isto know if a linear sampling have been followed as in essencethe samples are on a time-series. Some of the works have clearlymentioned the sampling type like Stratied (Groth &Muntermann, 2011) whereas most of the others surprisingly havenot mentioned anything at all.

    3.4.3. Sliding windowThe overall objective of the reviewed systems is to predict the

    market movement in a future time-window (prediction-window)based on the learning gained in a past time-window (training-window) where the machine learning algorithms are trained andpatterns are recognized.

    For example, a system may learn patterns based on the avail-able data for multiple days (training-window) in order to predictthe market movement on a new day (prediction-window). Thelength and location of the training-window on the timeline mayhave two possible formats: xed or sliding. If the training windowis xed, the system learns based on the data available from pointA to point B on the timeline and those 2 points are xed; forinstance, from date A to date B. In such a scenario the resultedlearning from the training window is applied to the prediction-window regardless of where on the timeline the prediction-window is located. It may be right after the training-window orit may be farther into the future with a distance from the train-ing-window. It is obvious that if there is a big distance betweenthe training-window and the prediction-window the learning cap-tured in the machine learning algorithm may not be up-to-dateand therefore accurate enough because the information availablein the gap is not used for training.

    Hence, a second format is introduced to solve the above prob-lem whereby the entire training-window or one side of it (the side

    7664 A. Khadjeh Nassirtoussi et al. / Expert Systat the end) is capable of dynamically sliding up to the point wherethe prediction-window starts. In other words, if the trainingwindow starts at point A and ends at point B and theprediction-window starts at point C and ends at point D. In thesliding-window format, the system always ensures that point Bis always right before and adjacent to point C. This approach issimply referred to in this work as sliding window. The reviewedworks which do possess a sliding window as a property of their sys-tem design are identied and marked in Table 4 under a columnwith the same name.

    Although, it intuitively seems necessary to implement a sliding-window, there are very few of the reviewed works which actuallyhave (Butler & Keelj, 2009; Jin et al., 2013; Peramunetilleke &Wong, 2002; Tetlock et al., 2008; Wuthrich et al., 1998). This seemsto be an aspect that can receive more attention in the futuresystems.

    3.4.4. Semantics and SyntaxIn Natural Language Processing (NLP) two aspects of language

    are attentively researched: Semantics and Syntax. Simply put:semantics deals with the meaning of words and Syntax deals withtheir order and relative positioning or grouping. In this section:Firstly, a closer look is cast at the signicance of each of themand some of the recent related works of research. Secondly, it isreported if and how they have been observed in the reviewedtext-mining works for market-prediction.

    Tackling semantics is an important issue and research effortsare occupied with it in a number of fronts. It is important todevelop specialized ontologies for specic contexts like nance;Lupiani-Ruiz et al. (2011) present a nancial news semantic searchengine based on Semantic Web technologies. The search engine isaccompanied by an ontology population tool that assists in keepingthe nancial ontology up-to-date. Furthermore, semantics can beincluded in design of feature-weighting schemes; Luo, Chen, andXiong (2011) propose a novel term weighting scheme by exploitingthe semantics of categories and indexing terms. Specically, thesemantics of categories are represented by senses of terms appear-ing in the category labels as well as the interpretation of them byWordNet (Miller, 1995). Also, in their work the weight of a termis correlated to its semantic similarity with a category. WordNet(Miller, 1995) provides a semantic network that links the sensesof words to each other. The main types of relations among Word-Net synsets are the super-subordinate relations that are hyperon-ymy and hyponymy. Other relations are the meronymy and theholonymy (Loia & Senatore, 2014). It is critical to facilitate seman-tic relations of terms for getting a satisfactory result in the text cat-egorization; Li et al. (2012) show a high performance of textcategorization in which semantic relations of terms drawing upontwo kinds of thesauri, a corpus-based thesaurus (CBT) and Word-Net (WN), were sought. When a combination of CBT and WNwas used, they obtained the highest level of performance in thetext categorization.

    Syntax is also very important and proper observation and utili-zation of it along with semantics (or sometimes instead of it) canimprove textual classication accuracy; Kim et al. (2014) proposea novel kernel, called language independent semantic (LIS) kernel,which is able to effectively compute the similarity between short-text documents without using grammatical tags and lexical dat-abases. From the experiment results on English and Korean data-sets, it is shown that the LIS kernel has better performance thanseveral existing kernels. This is essentially a Syntax-based patternextraction method. It is interesting to note that there are severalapproaches to such Syntax-based pattern-recognition methods:In the word occurrence method, a pattern is considered as a wordthat appears in a document (Joachims, 1998). In the word sequencemethod, it consists of a set of consecutive words that appear in a

    with Applications 41 (2014) 76537670document (Lodhi, Saunders, Shawe-Taylor, Cristianini, & Watkins,2002). In the parse-tree method which is based on syntacticstructure, a pattern is extracted from the tree of a document by

  • emsconsidering not only the word occurrence but also the wordsequence in the document (Collins & Duffy, 2001).

    Duric and Song (2012) propose a set of new feature-selectionschemes that use a Content and Syntax model to automaticallylearn a set of features in a review document by separating the enti-ties that are being reviewed from the subjective expressions thatdescribe those entities in terms of polarities. The results obtainedfrom using these features in a maximum entropy classier arecompetitive with the state-of-the-art machine learning approaches(Duric & Song, 2012). Topic models such as Latent DirichletAllocation (LDA) are generative models that allow documents tobe explained by unobserved (latent) topics. The Hidden MarkovModel LDA (HMM-LDA) (Grifths, Steyvers, Blei, & Tenenbaum,2005) is a topic model that simultaneously models topics andsyntactic structures in a collection of documents. The idea behindthe model is that a typical word can play different roles. It caneither be part of the content and serve in a semantic (topical) pur-pose or it can be used as part of the grammatical (syntactic) struc-ture. It can also be used in both contexts (Duric & Song, 2012).HMM-LDA models this behavior by inducing syntactic classes foreach word based on how they appear together in a sentence usinga Hidden Markov Model. Each word gets assigned to a syntacticclass, but one class is reserved for the semantic words. Words inthis class behave as they would in a regular LDA topic model, par-ticipating in different topics and having certain probabilities ofappearing in a document (Duric & Song, 2012).

    In Table 4, there is one column dedicated to each of theseaspects, namely: Semantics and Syntax. About half of the systemsare utilizing some semantic aspect into their text mining approachwhich is usually done by using a dictionary or thesaurus andcategorizing the words based on their meaning but none isadvanced in the directions pointed out above. Moreover, very fewworks have made an effort to include Syntax i.e. order and role ofwords. These basic and somewhat indirect approaches arenoun-phrases (Schumaker & Chen, 2009), word-combinations andn-grams (Hagenau et al., 2013) and simultaneous appearance ofwords (Huang, Chuang, et al., 2010; Huang, Liao, et al., 2010) andthe triplets which consist of an adjective or adverb and the twowords immediately following or preceding them (Das & Chen,2007). Some works like Vu et al. (2012) include Part of Speech(POS) tagging as a form of attention to Syntax. Loia and Senatore(2014) achieve phrase-level sentiment analysis by taking intoaccount four syntactic categories, namely: nouns, verbs, adverbsand adjectives. The need of deep syntactic analysis for the phrase-level sentiment-analysis has been investigated by Kanayama andNasukawa (2008).

    3.4.5. Combining news and technical data or signalsIt is possible to pass technical data or signals along with the text

    features into the classication algorithm as additional independentvariables. Examples for technical data could be a price or indexlevel at a given time. Technical signals are the outputs of technicalalgorithms or rules like the moving average, relative strength rules,lter rules and the trading range breakout rules. Few of theresearchers have taken advantage of these additional inputs asindicated in Table 4 (Butler & Keelj, 2009; Hagenau et al., 2013;Rachlin et al., 2007; Schumaker & Chen, 2009; Schumaker et al.,2012; Zhai et al., 2007).

    3.4.6. Used softwareLastly it is interesting to observe what some of the common

    third-party applications are that are used for the implementationof the systems. In this column of Table 4 the reader can see the

    A. Khadjeh Nassirtoussi et al. / Expert Systnames of the software pieces which were used by the reviewedworks as a part of their pre-processing or machine learning. Theyare mostly dictionaries (Li, 2010; Tetlock et al., 2008), classicationalgorithm implementations (Butler & Keelj, 2009; Werner &Myrray, 2004), concept extraction and word combination packages(Das & Chen, 2007; Rachlin et al., 2007) or sentiment value provid-ers (Schumaker et al., 2012).

    3.5. Findings of the reviewed works

    In Table 5, the evaluation mechanisms have been looked at aswell as the new ndings for each piece of research.

    Most of the works are presenting a confusion matrix or partsthereof to present their results. And calculate accuracy, recall orprecision and sometimes the F-measure, with accuracy being themost common. The accuracy in majority of the cases is reportedin the range of 5070 percent, while arguing for better than chanceresults which is estimated at 50 percent (Butler & Keelj, 2009; Li,2010; Mahajan et al., 2008; Schumaker & Chen, 2009; Schumakeret al., 2012; Zhai et al., 2007). It is a common evaluation approachand results above 55% have been considered report-worthy inother parts of the literature as well (Garcke, Gerstner, & Griebel,2013). However, what makes most of the results questionable isthat the majority of them surprisingly have not examined orreported if their experiment data is imbalanced or not. As this isimportant in data-mining (Duman, Ekinci, & Tanrverdi, 2012;Thammasiri, Delen, Meesad, & Kasap, 2014) an additional columnhas been placed in Table 5 to check for this. Among the reviewedworks only Soni et al. (2007), Mittermayer (2004) andPeramunetilleke andWong (2002) have paid some attention to thistopic in their works. It is also crucial to note if an imbalanced data-set with imbalanced classes is encountered specially with a highdimensionality in the feature-space, devising a suitable feature-selection that can appropriately deal with both the imbalanced-data and the high dimensionality becomes critical. Feature-selec-tion for high-dimensional imbalanced data is amplied in detailin the work of Yin, Ge, Xiao, Wang, and Quan (2013). Liu, Loh,and Sun (2009) tackle the problem of imbalanced textual datausing a simple probability based term weighting scheme to betterdistinguish documents in minor categories. Smales (2012) examinethe relationship between order imbalance and macroeconomicnews in the context of Australian interest rate futures marketand identify nine major macroeconomic announcements withimpact on order imbalance.

    Another popular evaluation approach in addition to theabove for half of the reviewed works is the assembly of a tradingstrategy or engine (Groth & Muntermann, 2011; Hagenau et al.,2013; Huang, Chuang, et al., 2010; Huang, Liao, et al., 2010b;Mittermayer, 2004; Pui Cheong Fung et al., 2003; Rachlin et al.,2007; Schumaker & Chen, 2009; Schumaker et al., 2012; Tetlocket al., 2008; Zhai et al., 2007). Through which a trading period issimulated and prots are measured to evaluate the viability ofthe system.

    In general researchers are using evaluation mechanismsand experimental data that widely vary and this makes anobjective comparison in terms of concrete levels of effectivenessunreachable.

    4. Suggestions for future work

    Market prediction mechanisms based on online text mining arejust emerging to be investigated rigorously utilizing the radicalpeak of computational processing power and network speed inthe recent times. We foresee this trend to continue. This researchhelps put into perspective the role of human reactions to eventsin the making of markets and can lead to a better understanding

    with Applications 41 (2014) 76537670 7665of market efciencies and convergence via information absorption.In summary, this work identies the below as areas or aspects inneed of future research and advancement:

  • d S

    ltipl

    pred

    e pr

    47.5

    ew

    emsTable 5Findings of the reviewed works, existence of a trading strategy and balanced-data.

    Reference Findings

    Wuthrich et al. (1998) Ftse 42%, Nky 47%, Dow 40% Hsi 53% anPeramunetilleke and Wong (2002) Better than chancePui Cheong Fung et al. (2003) The cumulative prot of monitoring mu

    of monitoring single time series.Werner and Myrray (2004) Evidence that the stock messages help

    returnsMittermayer (2004) Average prot 11% compared to averagDas and Chen (2007) Regression has low explanatory powerSoni et al. (2007) Hit rate of classier: 56.2% compared to

    bag-of-wordsZhai et al. (2007) Price 58.8%, direct news 62.5%, indirect n

    and news 70.1%

    7666 A. Khadjeh Nassirtoussi et al. / Expert SystA. Semantics: as discussed in Section 3.4.4, advancements oftechniques in semantics are crucial to the text-classicationproblem at hand as text-mining researchers have alreadyshown. However, such advancements have not yet enteredinto the eld of market-predictive text-mining. Develop-ment of specialized ontologies by creating new ones or cus-tomization of current dictionaries like WordNet requiresmore attention. Many of the current works of research arestill too much focused on word occurrence methods andthey rarely even use WordNet. Moreover, semantic relationscan be researched with different objectives, from deningweighting schemes for feature-representation to semanticcompression or abstraction for feature-reduction. Probablysince the market-predictive text-mining itself is an emerg-ing eld, the researchers are yet to dive into semanticsenhancement for the market-predictive context specically.

    Prot: for Price & News 5.1% in 2 months ahalf of it each

    Rachlin et al. (2007) Cannot improve the predictive accuracy of tfor join textual and numeric analysis and 8numeric alone

    Tetlock et al. (2008) (1) The fraction of negative words in rm-srm earnings; (2) rms stock prices brieyembedded in negative words; and (3) the efrom negative words is largest for the stori

    Mahajan et al. (2008) Accuracy 60%Butler and Keelj (2009) First method: 55% and 59% for character-gr

    respectively, still superior to the benchmaraccuracy and over-performance precision w

    Schumaker and Chen (2009) Directional accuracy 57.1%, return 2.06%, clLi (2010) Accuracy for tone 67% and content 63% with

    dictionary-basedHuang, Liao, et al. (2010) and

    Huang, Chuang, et al. (2010)Prediction accuracy and the recall rate up torespectively

    Groth and Muntermann (2011) Accuracy (slightly) above the 75% guessingSchumaker et al. (2012) Objective articles were performing poorly i

    Neutral articles had poorer trading returnsperformed better with 59.0% directional accPolarity performed poorly vs. baseline

    Lugmayr and Gossen (2012) In progressYu, Duan, et al. (2013a) Polarity with 79% accuracy and 0.86 F-mea

    number of social media counts has a signibut not with return. It is shown that the innegative relationship with return, but a higwith risk

    Hagenau et al. (2013) Feedback-based feature selection combinedaccuracies of up to 76%

    Jin et al. (2013) Precision around 0.28 on averageChatrath et al. (2014) (a) Jumps are a good proxy for news arriva

    systematic reaction of currency prices to ecrespond quickly within 5-min of the news

    Bollen and Huina (2011) The mood Calm had the highest granger ctime lags ranging from two to six days (p-vmood dimensions and OpinionFinder did nostock market changes

    Vu et al. (2012) Combination of previous dayss price movefeatures create a superior model in all 4 co80.49%, 75.61% and 75.00% and for the onlin84.62%Trading strategy Balanced data

    ti 40%. Yes Not mentionedNo Yes

    e time series is nearly double to that Yes Not mentioned

    ict market volatility, but not stock No No

    ot by random trader 0% Yes YesNo Not mentioned

    % for nave classier and 49.1% SVM No Yes (roughly)

    s 50.0%, combined news 64.7%, price Yes Not mentioned

    with Applications 41 (2014) 76537670B. Syntax: syntactic analysis techniques has received probablyeven less attention than semantic ones, as also discussed inSection 3.4.4. More advanced Syntax-based techniques likeusage of parse-trees for pattern recognition in text canimprove the quality of text-mining signicantly. This aspectrequires the attention of future researchers too, starting withattempting to transfer some of the learning in other areas oftext-mining like reviews classication into market-predic-tive text-mining.

    C. Sentiment: sentiment and emotion analysis has gained sig-nicance prominence in the eld of text-mining due to theinterest of governments and multi-national companies tomaintain a nger on the pulse of the public mood to winelections in the case of the former or just surprise their cus-tomers by the amount of insight about their preferences forthe latter. Interestingly, market-prediction is very closely

    nd for Price and News alone around

    he numeric analysis. Accuracy 82.4%0.6% for textual analysis, and 83.3%

    Yes Not mentioned

    pecic news stories forecasts lowunder-react to the informationarnings and return predictabilityes that focus on fundamentals

    Yes Not relevant

    No Not mentionedams and word-grams accuracyk portfolio. Second method: overallas 62.81% and 67.80% respectively

    No Not mentioned

    oseness 0.04261 Yes Not mentionednave Bayes and less than 50% with No No

    85.2689% and 75.3782% in average, Yes Not mentioned

    equivalent benchmark Yes Non directional accuracy vs. baseline.vs. baseline. Subjective articlesuracy and a 3.30% trading return.

    Yes Not mentioned.

    No Not mentionedsure on the test set. Only totalcant positive relationship with risk,teraction term has a marginallyhly negative signicant relationship

    No Not mentioned

    with 2-word combinations achieved Yes No

    No Not mentionedl in currency markets; (b) there is aonomic surprises; and (c) pricesrelease

    No Not mentioned

    ausality relation with the DJIA foralues < 0.05). The other four GPOMSt have a signicant correlation with

    No Not mentioned

    ment, bullish/bearish and Pos_Negmpanies with accuracies of: 82.93%,e test as: 76.92%, 76.92%, 69.23% and

    No Not mentioned

  • emsrelated to the mood of public or market-participants asestablished by behavioral-economics. However, in case ofthe analysis of sentiment with regards to a product theanticipation of what a piece of text entails is far morestraightforward than in the case of market-prediction. Thereare no secrets as to whether a product-review entails posi-tive or negative emotions about it. However, even the besttraders and investors can never be completely sure whatmarket-reaction to expect as a result of a piece of news-text.Therefore, there is a lot of room for market-predictive senti-ment investigation for future research.

    D. Text-mining component, textual-source or application-market specialization: this work has learnt that the currentworks of research on market-predictive text-mining arerather holistic with one-off end-to-end systems. However,in the future, the text-mining process should be brokendown into its critical components like feature-selection,feature-representation and feature-reduction and each ofthose needs to be specically researched for the specializedcontext of market-prediction; some specic suggestions foreach component has been made in Sections 3.3.13.3.3 ofthis work. Furthermore, market-predictive text-mining canalso become even more specialized by focusing on a specicsource of text e.g. a specic social media outlet or news-source or text-role like news-headlines vs. news-body, etc.Moreover, there is a need for specialized research on eachtype of nancial markets (stocks, bond, commodity, money,futures, derivatives, insurance, forex) or on each geographi-cal location.

    E. Machine Learning Algorithms: in Section 3.4 it has beenexplained thoroughly how SVM and Nave Bayes are heavilyfavored by researchers, probably due to their straightfor-wardness, while many other machine learning algorithmsor techniques like Articial Neural Networks (ANN),K-Nearest Neighbors (k-NN), fuzzy-logic, etc. show seriouslypromising potentials for textual-classication and senti-ment-analysis elsewhere in the literature but have not yetbeen experimented with in the context of market-pred