In bot we trust: A new methodology of chatbot performance … · in this article is to track the evolution of chatbots from simple syntactic systems to robust natural language processing

Business Horizons (xxxx) xxx, xxx

Available online at www.sciencedirect.com

ScienceDirectwww.journals.elsevier.com/business-hor izons

In bot we trust: A new methodology ofchatbot performance measures

Aleksandra Przegalinska a,*,^, Leon Ciechanowski a,b,^,Anna Stroz c, Peter Gloor d, Grzegorz Mazurek a

a Kozminski University, Jagiellonska 57/59, Warsaw, Polandb University of Social Sciences & Humanities, Chodakowska 19/31, Warsaw, Polandc University of Warsaw, Krakowskie Przedmie�scie 26/28, Warsaw, Polandd MIT Center for Collective Intelligence, 245 First Street, E94-1509 Cambridge, MA,U.S.A.

KEYWORDSArtificial

*

Pr^

ar

ht00

intelligence;Chatbots;Chatbotperformance;Human-computerinteraction;Performancegoals;Customer trust;Customerexperience

Corresponding authorE-mail address: aprzegalinska

zegalinska)A. Przegalinska and L. Ciechano

ticle equally.

tps://doi.org/10.1016/j.bushor.20107-6813/ª 2019 Kelley School of Bu

Abstract Chatbots are used frequently in business to facilitate various processes,particularly those related to customer service and personalization. In this article,we propose novel methods of tracking human-chatbot interactions and measuringchatbot performance that take into consideration ethical concerns, particularlytrust. Our proposed methodology links neuroscientific methods, text mining, andmachine learning. We argue that trust is the focal point of successful human-chatbot interaction and assess how trust as a relevant category is being redefinedwith the advent of deep learning supported chatbots. We propose a novel methodof analyzing the content of messages produced in human-chatbot interactions, us-ing the Condor Tribefinder system we developed for text mining that is based on amachine learning classification engine. Our results will help build better social botsfor interaction in business or commercial environments.ª 2019 Kelley School of Business, Indiana University. Published by Elsevier Inc. Allrights reserved.

1. Human-machine interaction today

Artificial intelligence is an established and inter-nally diverse academic discipline spanning many

@kozminski.edu.pl (A.

wski contributed to this

9.08.005siness, Indiana University. Pub

decades. It embraces such subfields as naturallanguage processing, machine learning, robotics,and many others. With the major advancementsin machine learning and the arrival of big data, ithas become a discipline of utmost importancefor the global economy and society as whole, aswell as rapidly changing businesses and organi-zations. Current machine learning capabilitiesinclude competing at the highest level in strategic

lished by Elsevier Inc. All rights reserved.

mailto:[email protected]

https://doi.org/10.1016/j.bushor.2019.08.005

https://doi.org/10.1016/j.bushor.2019.08.005

www.sciencedirect.com/science/journal/00076813

www.journals.elsevier.com/business-horizons

2 A. Przegalinska et al.

games, autonomously operated vehicles, intelli-gent routing in content delivery networks,anddmost importantly in the context of thisarticledunderstanding and generating naturalspeech. As we look at how far artificial intelli-gence is reaching, ask this question: How far canand should we reach with further developments inthe responsiveness of chatbot systems?

Several real-world cases have shown thetremendous power that AI represents for com-panies. Facebook uses AI for facial recognitionto make its applications better than humans atdetermining if two pictures are of the same per-son. Motorbike manufacturer Harley-Davidsonincreased New York sales leads almost 30-foldfrom less than five to around 40 per day throughAI-enabled hyper-targeted and customized com-munications. These examples show that artificialintelligence, in addition to being a hot topic and abuzzword, has made its way into a variety of or-ganizations and has become top-of-mind for manycorporations and institutions. However, many is-sues are still unsolved. The more AI enters theworld of business, the more technological, socie-tal, and ethical questions arise.

In this article, we focus on the importantchatbot subdiscipline within AI. Chatbots areinteractive, virtual agents that engage in verbalinteractions with humans. This technology is aninteresting case in human-machine interaction asthey are designed to interact with users throughthe usage of natural language based on formalmodels. This 60-year-old technology was initiallydesigned to determine whether chatbot systemscould fool users into believing that they were realhumans. Quite early, it was confirmed that, whenit comes to simulating interactions, cheatinghumans was possible (Khan & Das, 2018; Sloman,2014). Recently, however, the main goal ofbuilding chatbot systems is not only to mimichuman conversation and cheat users in variousiterations of the Turing test (Aron, 2015; French,2012; Hernandez-Orallo, 2000; Moor, 2003) butalso to apply well-trained chatbots widely tointeract with the users in business, education,or information retrieval. For quite a longtime, applying chatbots was cumbersome andrisky (Kluwer, 2011; Shawar & Atwell, 2007a).Currently, however, chatbots can facilitatevarious business processes, particularly thoserelated to customer service and personalizationbecause of their accessibility, fairly low cost, andease of use for the end consumers. Our main goalin this article is to track the evolution of chatbotsfrom simple syntactic systems to robust naturallanguage processing entities that use deep

learning (i.e., Google Duplex, Microsoft Cortana,GPT-2) and expose how this change effects busi-ness and organizations.

We propose a novel methodology of trackinghuman-chatbot interactions and measuringchatbot performance that takes into consider-ation ethical concernsdin particular, trust to-ward chatbots. Our research links neuroscientificmethods, text mining, and deep learning with theissue of trust allocated in chatbots and theiroverall transparency. On a more general level,we address a wide range of issues of both theo-retical and practical importance including thegeneral effects of AI on companies and in-stitutions, how AI should be used in order tocreate value, the potential negative effects of AIand how to overcome them, and ethical ques-tions that emerged from the arrival of artificialintelligence.

2. Chatbots in organizations

One of the very first chatbots, ELIZA, was createdin 1964e1966 at the MIT Artificial Intelli-gence Laboratory and was one of the first at-tempts to simulate human conversation. ELIZA(Weizenbaum, 1966; Weizenbaum & McCarthy,1976) simulated conversations by using both asubstitution approach and pattern matching thatbeguiled users with an illusion of understandingand reciprocity. Interaction directives hidden inscripts allowed ELIZA to process user logs andengage in communication. ELIZA’s most knownscript was DOCTOR, which simulated a Rogerianpsychotherapist representing an approach thatempowers the patient in the therapeutic pro-cess. ELIZA creator Joseph Weizenbaum regardedthe program as a method to expose the superfi-ciality of human-machine communication butwas surprised by the number of patients whoattributed human-like feelings to ELIZA andwanted to continue their therapy with it. ELIZA’ssuccess was so surprising because the chatbotwas incapable of understanding anything. WhileELIZA could engage in discourse perceived by theinterlocutor as fruitful, it could not converse onany deeper, semantic level, which led to a newidea for developing chatbots. Namely, manychatbot developers felt encouraged to usediverse types of tricks to delude chatbot users.Instead of focusing on cumbersome work withnatural language processing, they focused on theeasiest possible ways to fool users. Such at-tempts were often successful as in the cases ofPARRY (Curry & O’Shea, 2012; Phrasee, 2016) and

1 https://promo.bankofamerica.com/erica/

A NEW METHODOLOGY OF CHATBOT PERFORMANCE MEASURES 3

Eugene Gootsman (Aamoth, 2014; Shah,Warwick, Vallverdu, & Wu, 2016). Whereas thePARRY effectively simulated a paranoid person byiterating threats and suspicions in every conver-sation, Eugene Gootsman simulated a 14-year-oldboy from Ukraine with limited knowledge of theworld and a lower competence in speakingEnglish.

We think of these scripts and default responsesas the simulation paradigm. It evokes the Chineseroom argument. However, with the rise of deepnatural language processing capacities, this maysoon change as more ‘aware’ systems will becreated. Even before the arrival of graphical userinterfaces, the 1970s and 1980s saw rapid growthin text and natural language interface research.Since then, a range of new chatbot architecturesappeared, including MegaHAL, Converse, andA.L.I.C.E. (Shawar & Atwell, 2005, 2007b), whichstores knowledge about the English conversationpatterns in artificial intelligence markup language(AIML) files.

The choice of chatbots is much broader and in-cludes a great deal of machine learning-supportedconsumer technologies like Siri and Cortana (Dunn,2016; Hachman, 2014; Lopez, Quesada, &Guerrero, 2017). These may assume the shape ofvirtual agents or physical objects which we couldalso research from the perspective of proxemic re-lations that they maintain with users, as well asgestural communication (Ciechanowski,Przegalinska, Magnuski, & Gloor, 2019).

Amazon launched the Alexa (Hayes, 2017)project as a platform for a better jukebox but thecompany developed it into an artificial intelli-gence system built upon and continuouslylearning from human data. The Alexa-poweredcylinder Echo and the smaller Dot are omni-present household helpers that can turn off thelights, tell jokes, or let users consume the newshands-free. They also collect reams of data abouttheir users, which Amazon utilizes to improveAlexa and develop its possibilities (Hayes, 2017).Alexa maintains 15,000 skills but 14,997 of themremain untapped as most of the users are notaware of them (Przegalinska, 2018). Even thoughsocial robots like Alexa have progressed beyondELIZA as far as usages and capacities in real-life,noisy environments, they are still embedded inthe simulation paradigm. For this reason, bothGoogle and OpenAI with their respective deeplearning supported bots Duplex and GPT-2 offerAmazon hefty competition despite only boasting378 skills for the Google Assistant (Leviathan &Matias, 2018), considering it has been trainingits AI with petabytes of data.

3. Current chatbot performancemeasurements

When it comes to performance measurements,there are certain standards even though the usageof metrics differs by industry (Botanalytics, 2018).Surveys of companies deploying chatbots (Asher,2017; Castro et al., 2018) showed how companiesacross industries have been using chatbot analyticsand which parameters they consider crucial inmeasuring performance. For instance, when itcomes to banking and fintech sectors, chatbots areused mainly to assist users in performing tasksquicker, while lowering call volumes and cuttingcosts on the service (Okuda & Shoda, 2018). One ofthe most important chatbot performance metricsis conversation length and structure. Most of theindustry trends suggest aiming for shorter conver-sations with simple structure because bankingchatbots intend to provide fast solutions (e.g.,sending/receiving money, checking a balance).

Retention rate is yet another important metricfor measuring chatbot success. If a company’schatbot aims to replace other communicationchannels (e.g., lowering call volume), the goal isto obtain significantly higher retention, which is anindicator of consumer satisfaction. However, thereare plenty of other automated options that allowusers to manage accounts easily without speakingto a human. Thus, if the chatbot is focused moreon customer support, a high retention rate doesnot necessarily have to be the measure of success.

Another measurement of chatbot performanceis the ability to provide personalized communica-tion. A financial chatbot should retain and analyzeuser information and subsequently provide userswith relevant offers. For instance, Erica, a chatbotcreated for Bank of America, provides users withtips on how they can better save money dependingon their spending habits.1

Retail chatbots provide gamified ways to shop.Through a conversational interface, brands trans-form themselves into personal shopping assistants.Unlike banking chatbots, retail chatbot creatorsusually look for a high number of conversationsteps in chatbot engagement metrics. While usersmight seek a fast solution, retail chatbots aresupposed to capture and hold users’ attention byencouraging them to browse, providing responsesabout products or cross/upselling a purchase. Aninteresting example here is eBay’s ShopBot, which

https://promo.bankofamerica.com/erica/


offers several options to encourage browsing andpurchasing.

4. Trust in the organizational context

Trust is generally defined as a firm belief in thereliability, truth, or ability of someone or some-thing, or “an arrangement whereby a person (atrustee) holds property as its nominal owner forthe good of one or more beneficiaries” (HOMSSolicitors, 2019). Trust among humans may beperceived in terms of credibility and confidence inone another’s judgment, as well as predictabilityof one’s behavior (Cassell & Bickmore, 2000).

In the organizational context, trust can bedefined in multiple ways that possess a commonground of belief (Dunn & Schweitzer, 2005). It canbe understood as:

� Competence, defined as “the belief that an or-ganization has the ability to do what it says itwill do.[including] the extent to which we seean organization as being effective; that it cancompete and survive in the marketplace”(Institute for Public Relations, 2003);

� Integrity, defined as the belief that an organi-zation is fair and just; or

� Reliability, defined as the belief that an orga-nization “will do what it says it will do, that itacts consistently and dependably” (Grunig,2002).

Although the experts are not in complete agree-ment, trust between an organization and its publicis generally described as having various indepen-dently quantifiable characteristics including, butnot limited to:

� Multilevel aspectdtrust results from in-teractions within and between organizations;

� Communication and discourse-based aspectdtrust emerges from interactions and communi-cative behaviors;

� Culture-driven aspectdtrust is correlated withvalues, norms, and beliefs; and

� Temporal aspectdtrust is constantly changingas it cycles through phases of establishing,breaking, discontinuing, etc. (Hurley, 2011;Shockley-Zalabak, Morreale, & Hackman, 2010).

In terms of linking trust and new technologies withAI in an organizational context, Kaplan andHaenlein (2019, p. 22) cited three common traitsrelevant “both internally and externally: confi-dence, change, and controldthe three Cs of theorganizational implications of AI.” An importantcategory we will elaborate on later in this article isexternal confidence, defined here as confidence inthe abilities and recommendations of an organi-zations’ AI systems.

5. Trust in chatbots

The issue of trust in bot performance is importantfor several reasons. Particularly, trust should beconsidered during the implementation of chatbotsin environments such as financial care, healthcare,and other fields demanding sensitive data by whichusers may be exposed to physical, financial, orpsychological harm (Bickmore & Cassell, 2001).Trust may be a crucial factor because users do notwant to share personal information if they areunsure about the security (Chung, Joung, & Ko,2017b).

A crucial part of trust is related to anthropo-morphization (Ciechanowski et al., 2019; Lotze,2016; Radziwill & Benton, 2017; Stojni�c, 2015).Our intuition may lead us to claim that becausechatbots interact by conversation, it is appropriateto anthropomorphize them. There are, however,two opposing positions. One of them claims that ifan agent is more humanlike, then it is moreprobable that a sustainable trust relationship be-tween agent and user will be made, known as ahuman-human perspective. Since humans placesocial expectations on computers, anthropomor-phization increases users’ trust in computer agents(i.e., the more human-like computer is, the moretrust humans apply to it). The second point of viewis that humans place more trust on computerizedsystems as opposed to other humans (Seeger &Heinzl, 2018). According to this view, high-qualityautomation leads to a more fruitful interactionbecause the machine seems to be more objectiveand rational than a human. Humans tend totrust computer systems more than other humansbecause humans are expected to be imper-fect while the opposite is true for automation(Dijkstra, Liebrand, & Timminga, 1998; Seeger &Heinzl, 2018).

It is important to stress, however, that theprocess of anthropomorphization is not only aboutthe attribution of superficial human characteristicsbut most importantly this essential one: a hu-manlike mind. People trust anthropomorphized

2 DeepMind applies the ToM-net to agents behaving in simplegridworld environments, showing that it learns to modelrandom, algorithmic, and deep reinforcement learning agentsfrom varied populations, and, more interestingly, passes classicToM tasks such as the “Sally- Anne” test (Baron, 1985; Wimmer& Perner, 1983) of recognizing that others can hold false beliefsabout the world. The authors argue that this system is animportant step forward for developing multi-agent AI systems,for building intermediating technology for machine-humaninteraction, and for advancing the progress on interpretable AI.


technology more than the mindless one, particu-larly because of the attribution of competence(Waytz, Heafner, & Epley, 2014) resulting in asignificant increase of trust and overall perfor-mance of cooperation (de Visser et al., 2012;Waytz et al., 2014). The reliability of various ele-ments of the digital economy ecosystem (e.g.,wearable devices, systems and platforms, virtualassistants) arises as an important problem whenone considers the level of trust allocated to them.In a social context, trust has several connotations.Trust is characterized by the following aspects:one party (trustor) is willing to rely on the actionsof another party (trustee), and the situation isdirected to the future. In addition, the trustor(voluntarily or forcedly) abandons control over theactions performed by the trustee. As a conse-quence, the trustor is uncertain about theoutcome of the other’s actions; they can onlydevelop and evaluate expectations. Thus, trustgenerally can be attributed to relationships be-tween people. It can be demonstrated thathumans have a natural disposition to trust and tojudge trustworthiness that can be traced to theneurobiological structure and activity of a humanbrain. When it comes to the relationship betweenpeople and technology, the attribution of trust is amatter of dispute. The intentional stance demon-strates that trust can be validly attributed tohuman relationships with complex technologiesand machine-learning-based trackers and sensorscould be considered complex technologies(Taddeo, 2010). This is specifically true for infor-mation technology that dramatically alters causa-tion in social systems: AI, wearable tech, bots,virtual assistants, and data. All that requires newdefinitions of trust.

In terms of trust in chatbots nowadays, one candistinguish at least two other relevant dimensionsin addition to anthropomorphization (Nordheim,2018):

� Ability/expertise: Performance measurements(e.g., customer retention, conversation length,lead generation). Expertise is seen as a factorassociated with credibility, a cue for trustwor-thiness. In the context of an automated system,trust has been argued to be mainly based onusers’ perceptions of the systems’ expertise.Here, the chatbot’s expertise is assumed toimpact users’ trust.

� Privacy/safety: “Security diagnostics exposevulnerabilities and privacy threats that exist incommercial Intelligent Virtual Assistants (IVA)ddiagnostics offer the possibility of securer IVA

ecosystems” (Chung, Iorga, Voas, & Lee, 2017a,p. 100).

In the past, one could argue the alreadymentioned Turing test was a specific form of trust-related practice. It was an experiment that wasconceived as a way of determining the machine’sability to use natural language and indirectly proveits ability to think in a way similar to a human and,as such, it touched upon two dimensions of trust:anthropomorphism and expertise. In the case ofthe latter, trust was related to human expertise toproperly assess the status of the interlocutor(human/nonhuman). Today, however, even trans-parency becomes a problematic topic. Asmentioned before, Google Duplex in specific,narrowly-specialized topics exhibits a close-to-human level of human speech understanding andsynthesis, as well ability to use nonverbal cues thatthe Turing test in previous forms would be inade-quate for testing and analysis anymore. Moreimportantly, in the samples of conversation pro-vided by Google Duplex, testing did not reveal itwas a bot and not a human being (Harwell, 2018).This is not to say that the Turing test should besuspended as a tool, but that it needs to bereformulated. What is more, DeepMindda BritishAI company acquired by Google in 2014dis devel-oping a project that may reshape the chatbot in-dustry in an even more dramatic way.

DeepMind explores the field of the theory ofmind (Premack & Woodruff, 1978) that broadlyrefers to humans’ ability and capacity to observevarious mental states as mental states belonging toother human beings. DeepMind proposed to train amachine to build such models and attempts todesign a Theory of Mind2 neural network (ToM-net)which uses metalearning to build models of theagents it encounters from observations of theirbehavior alone (Rabinowitz et al., 2018): “Throughthis process, it acquires a strong prior model foragents’ behavior, as well as the ability to bootstrapto richer predictions about agents’ characteristicsand mental states using only a small number ofbehavioral observations.”

Figure 1. Two types of the chatbots*

Note: Each user was randomly assigned to talk with one of them. On the left is the simple text chatbot, on the right is the avatar withsound.


What this implies in the context of ourresearchdand more broadlydis that human-chatbot interaction is a new method of conceptu-alizing and researching trust. As we alreadydiscussed, in the case of nonblackbox systems (e.g.,chatbots based on stochastic engines), attributingand researching trust was already difficult. Withdeep learning systems, new dimensions of trust-based human-machine relations open up.

3 There are no tools for automatic text analysis or text miningin Polish language, therefore we needed first to translate themessages to English. For the purpose of translation of files withchatbot logsdusers’ prompts and chatbot’s responsesdfromPolish to English language, the DeepL# platform was used.DeepL Translator applies neural machine translation, resultingin impressive accuracy and quality of the text. Parts of sen-tences were translated manuallydparticularly because of somemental shortcuts and colloquial phrases introduced by users.The total number of translated entries was 1,373 for 13 users.

6. Methodology

We conducted a study (the first of its kind) ofhuman-chatbot online interaction (Ciechanowskiet al., 2019; Ciechanowski, Przegalinska, &Wegner, 2018) using both subjective (question-naires) and objective (psychophysiology) mea-sures. The study consisted of interactions with oneof two chatbot types: a simple text chatbot or anavatar with sound (Figure 1).

During the conversation with the chatbot, wegathered psychophysiological data from the par-ticipants and at the end of the procedure, theywere asked to fill out a couple of questionnaires(Fiske, Cuddy, Glick, & Xu, 2002; Fong,Nourbakhsh, & Dautenhahn, 2002; Pochwatkoet al., 2015) that took into account cooperationbetween humans and chatbots, realness of thechatbot, and its human likeness. The resultsimportant for the current study indicated thatusers were more satisfied and happier during theinteraction with the simple text chatbot, whichwas confirmed in both questionnaire results andpsychophysiological markers. These results havealso proven the uncanny valley hypothesis (Mori,

MacDorman, & Kageki, 2012), which states thatusers and consumers are less satisfied wheninteracting with artificial objects ineptly imitatinga human, and this is what our avatar chatbot wasattempting to do. This result was also manifesteddirectly in the uncanny valley measure that weapplied in our experimental procedure.

However, the research question that was stillinteresting for us was related to linguistic corre-lations of the uncanny valley effect. In otherwords, perhaps the inept imitation of the avatarchatbot led the users to verbally behave in a spe-cific way toward it, and the chatbot, therefore,responded in different ways to them, irritatingthem more than the simple text chatbot. In orderto answer this question, we analyzed the messagesusers produced in their interaction with the avatarchatbot.3

For the analysis of the translated messages, weused Tribefinder, a novel machine-learning-basedsystem able to reveal Twitter users’ tribal affilia-tions (De Oliveira & Gloor, 2018; Gloor, Colladon,de Oliveira, & Rovelli, 2018). In this case, a tribeis “a network of heterogeneous persons.who arelinked by a shared passion or emotion” (Cova &Cova, 2002, p. 602). In the market domain, users

Figure 2. Mean line of user-chatbot exchange and number of messages for each user


or consumers affiliate with different tribes andthereby can be characterized by various behaviors,rituals, traditions, values, beliefs, hierarchy, andvocabulary. The analysis and successful use oftribalism may be a crucial factor in modern mar-keting and management (Cova & Cova, 2002;Moutinho, Dionısio, & Leal, 2007).

Initially, the Tribecreator was used to selecttribal influencers and leaders, and Tribefinder wascreated to establish tribes to which individualsbelong through the analysis of their tweets and thecomparison of vocabulary. Technically, Tribefinderis backed up by a machine learning system, whichwas trained with Twitter feeds of 1,000 journalistsfrom The New York Times, The Washington Post,The Guardian, and with members of the U.S.Congress. We applied this system, in our case, toanalyze the content of users’ and chatbots’ mes-sages. Since we were interested in the reason forthe users being less willing to cooperate with theavatar chatbot, we analyzed the interactions inthis group using the Tribefinder.

The system was able to process messages from13 out of 15 persons from the avatar chatbot groupsince it is based on machine learning algorithmsand needed sufficient amounts of data in order tocarry out a meaningful classification. The meannumber of interactions (defined by the user-chatbot exchange of one message) of theselected users was 60.3 (Figure 2).

The Tribefinder systemwas able to allocate usersinto five tribe macro-categories (see Table 1):

� Personality;

� Ideology;

� Lifestyles;

� Alternative reality; and

� Recreation.

Ideology describes proponents of the mainpolitical ideologies, trained with 100 human rep-resentatives of each of the 3 main ideologies: so-cialism, capitalism, and liberalism. The creators ofTribefinder had to add the category of complainersto this list, as some people did not fit into anymainstream ideology (Gloor et al., 2018). Exam-ples of messages based on which the Condor sys-tem placed bots and people into differentcategories can be found in the Appendix.

7. Results

After the analysis of messages generated by usersand by the chatbot in the interaction with eachuser, we observed that all subjects and chatbots(except in two instances) were of the journalistpersonality type; therefore, they did not usedeceitful language. Similarly, all persons andchatbots were complainers (except one person andone chatbot classified as liberal), and thus it wasimpossible to discern their ideology.

We observed that almost 50% of the users can becharacterized as displaying characteristics of the

Table 1. Definitions of tribes assigned to users and chatbots by the Tribefinder system

Tribal macro-category

Tribe type Characterization

Personality Journalist People stating the truth.

Politician People trying to lie or using deceitful language.

Ideology Liberalism People subscribing to the liberal ideology.

Capitalism People subscribing to the capitalist ideology.

Socialism People subscribing to the socialist ideology.

Complainers People not expressing any ideas related to the 3 main politicalideologies.

Lifestyle Fitness People fond of doing sports, even to the point of being compulsive.

Sedentary The opposite of the fitness category, these people apply an inactivelifestyle.

Vegan People who avoid animal products in their diet.

YOLO People living according to the motto You Only Live Once. Can displayhedonistic, impulsive, or reckless behavior.

Alternativereality

Fatherlander(Nationalist)

Conservatists, extreme patriots, family people, and xenophobes.

Nerd (Technocrats) Technology enthusiasts, transhumanists, fans of globalization andnetworking.

Spiritualist People believing in some sort of sacred power and meaning. Keen ofspiritual contemplation.

Treehugger(Environmentalists)

People stressing the importance of the protection of nature. Therefore,they accept some technologies (e.g., alternative energies), butchallenge other (e.g., gene manipulation).

Recreation Art People interested in art, cherishing its beauty and emotions.

Fashion People following fashion (in the domain of clothing, hairstyle, orbehavior).

Sport People following sport events on media. Some of them also activelypractice sports.

Travel People loving to travel, getting to know different cultures and places.

Source: (Gloor et al., 2018)


vegan tribe, 33% as a YOLO, and 25% as a sedentary(Figure 3). We found that every instance of achatbot interacting with these users was assignedby the Tribefinder to the sedentary tribe. Thechatbot used in our experiment is stationary andnonrobotic; therefore, this affiliation is correct.

When it comes to the alternative realities, weobserved an interesting picture. Looking at theusers, only two categories emerge: spiritualism(69%) and nerd (31%). However, when we looked atthe tribes the chatbots affiliated with this category,we observed all four tribes represented (Figure 4),with the tribes of treehugger and nerd standing out.Interestingly enough, an opposite situation takesplace with the recreation category (Figure 5).

These results indicate that users can be mostlydescribed as using spiritual language in

conversations with chatbots. This becomes moreunderstandable when we combine it with the factthat the users knew that they were having a con-versation with an artificial being and were pro-voking the chatbot to say something human.Moreover, the users agreeing to participate in ourexperiment were likely to be technology enthusi-asts, thereby the other alternative reality tribalcategory they showed was nerd. The above-mentioned differences between humans and botsare more visible when presented on a stack columnchart (see Figure 6).

Figure 6 renders the similarities and differencesbetween human and bot messages. Evidently, thebot adjusted its messages to humans in the cate-gories of personality and ideology, while therewere significant differences in the categories of

Figure 3. Lifestyles of users


lifestyle, alternative realities, and recreation.These results are understandable as the chatbothas been following humans in the most abstractand general categories like using mostly truthfullanguage (journalist type) and, just like humanparticipants, the bot manifested no political be-liefs (complainer type). This, however, may haveserious consequences. Let us imagine a chatbotthat represents a sports company versus clients. Ifthe chatbot mirrors statements of a user that hatessports, it will say things that go against the com-pany’s mission, vision, and values. Such a chatbotwill most definitely not be an interface that doesits job well.

8. Essentials in chatbot measurementsfor emerging chatbots

Looking at the Gartner (2018) Hype Cycle, tech-nologies that have not lived up to their hype endup in the obsolete before plateau category. Giventhe significant gap between Alexa skill develop-ment and use, Alexa and many other chatbots maybe classified in this category. Several chatbots and

Figure 4. Alternative realities of users and chatbots

social robots may turn out to be useful only as aninterface to a search enginedas the thing thatasks follow-up questions to refine the user’s searchto find exactly what the user is looking for.

On the other hand, developers believe that voice-based AI devices are not just jukeboxes that iteratethe simulation paradigm with better technologies.In fact, chatbots are the perfect example ofthe implementation of state-of-the-art consumer-oriented artificial intelligence that does not onlysimulate human behavior based on formal modelsbut also adapts to it. As such, chatbots remain afascinating subject for the research of patterns inhuman and nonhuman interaction along with issuesrelated to assigning social roles to others, findingpatterns of (un)successful interactions, and estab-lishing social relationships and bonds. In that sense,chatbots are gradually becoming robust, context-aware systems. Let us not forget that scientists andengineers initially created chatbots because peoplewanted to use natural language to communicatewith computer systems smoothly. (Zadrozny et al.2000, pp. 116e117) argued that the best way tofacilitate human-computer interaction is by allowingusers “to express their interests, wishes, or queriesdirectly and naturally, by speaking, typing, andpointing.” Morrissey and Kirakowski (2013) made asimilar point in their criteria for the development ofa more human-like chatbot. Perhaps chatbots shouldbecome more sensitive to human needs by searchingand delivering information that people need andwould otherwise be unable to obtain (Kacprzyk &Zadrozny, 2010; Morrissey & Kirakowski, 2013).

Taking into account that scientists and re-searchers are developing bots and chatbots so thatthey become more refined, nuanced, context-aware, and transparent collaboration assistants,there are few relevant changes related to rede-fined dimensions of trust. Namely, on the basis ofour research and the major developments in

Figure 5. Recreation of users and chatbots


botics, there are new dimensions of trust we wouldlike to suggest:

� Transparency and honesty: the agent sendshonest signals in communication, does not speaklike a politician (according to Tribefinder cate-gories), and it does not deny its status.

� Predictability, already mentioned in publica-tions concerning trust and chatbots (Nordheim,2018), could be supplemented by integrity,seen as a factor associated with credibility, andconcerns the trustors’ expectation that an ob-ject of trust will act consistently in line withpast experience. If the user perceives the

Figure 6. Human and chatbot tribe types

Note: Presented in separate tribal macro-categories (e.g., personalitribal macro-category, for instance, in the personality tribal macro-consequenced0% of them were categorized as politicians.

chatbots as predictable, this may lead to afeeling of trust in the chatbot.

� Control and benevolence describe the degree towhich the motivations and intents of the trusteeare in line with those of the trustor.

This last element is strongly linked with thechallenge of unblackboxing AI and introducingthe greater explainability of information pro-cessing performed by AI (Petkovic, Kobzik, & Re,2018; Preece, 2018). Integrity and benevolencecould be further researched by Tribefinderand similar tools, as well as applied in businesschatbots.

ty). Percentage shows how many persons/bots were in a specificcategory 100% humans were categorized as journalists, anddin


9. Recommendations for business

Whether we want it or not, bots and chatbots arealready becoming actors in the online world.Working on new methods of creating trustworthyconversational agents for the future is a task ofscience, research, and business with business ulti-mately implementing the methods. Taking into ac-count the emerging trend of cobotization (Meda,2017), finding paths toward creating safe, useful,and robust chatbot solutions is a very importanttask. Therefore, our recommendation is a focus onthe three new trust-enhancing features of chatbotsthat we enumerated in the previous section (i.e.,transparency, integrity, explainability) that allowfor greater control. What remains to be seen iswhether humanoid bots and robotsdoftenemployed in the sales or customer servicedepartmentsdare truly able to interact withhumans and how much real interaction and coop-eration between a human being and a chatbot orrobot is possible.

Appendix

User message

What did you do today? I tTired? PlAre you tired? I aThat’s good GrGreat I’mWhat are your plans for the evening? I w

This excerpt is from the user nr 3. Both the user and this chatbotpersonality type (using truthful language).

User message

Hi :) Hey!I am Oskar I am very pHow are you? Great, a lotNice weather. I haven’t leOk, I have few questions. Every day IAha, great. I’m glad.I have few questions. I’m waiting

This excerpt is from the user nr 16. The user was assigned by thechatbot instance assigned to the politician (using deceitful langua

References

Aamoth, D. (2014). Interview with Eugene Goostman, the fakekid who passed the Turing test. Time. Available at: https://time.com/2847900/eugene-goostman-turing-test/

Aron, J. (2015). Forget the Turing testdthere are better waysof judging AI. NewScientist. Available at: https://institutions.newscientist.com/article/dn28206-forget-the-turing-test-there-are-better-ways-of-judging-ai/

Asher, N. (2017). A warmer welcome: Application of a chatbotas a facilitator for new hires onboarding. Available at:http://www.diva-portal.org/smash/record.jsf?pidZdiva2%3A1116842&dswidZ4708

Baron, J. (1985). Rationality and intelligence. Cambridge, UK:Cambridge University Press.

Bickmore, T., & Cassell, J. (2001). Relational agents: A modeland implementation of building user trust. In Proceedings ofthe SIGCHI conference on human factors in computing sys-tems (pp. 396e403). New York, NY: SIGCHI.

Botanalytics. (2018). How chatbot performance metrics differ byindustry. Available at: https://chatbotslife.com/how-chatbot-performance-metrics-differ-by-industry-b380d4bd7f6b

Cassell, J., & Bickmore, T. (2000). External manifestations oftrustworthiness in the interface. Communications of theACM, 43(12), 50e56.

Castro, F., Tedesco, P., Alves, H., Quintino, J. P., Steffen, J.,Oliveira, F., et al. (2018). Developing a corporate chatbotfor a customer engagement program: A roadmap. In Intel-ligent computing theories and application (pp. 400e412).Cham, Switzerland: Springer International.

Chung, H., Iorga, M., Voas, J., & Lee, S. (2017a). Alexa, can Itrust you? Computer, 50(9), 100e104.

Chatbot message

hink I’m working around the clock.ease try to formulate your question differently.m not tired, rather the opposite. I am full of energy.eat!glad, too.ill sit here. That’s how I spend most of the evenings.

instance were assigned by the Condor system to the journalist

Chatbot message

leased about this!of clients.ft since yesterday, don’t ask me about the weather.have a different color - 365 types for the whole year.

for your questions.

Condor system to the journalist personality type, while thisge).

https://time.com/2847900/eugene-goostman-turing-test/

https://time.com/2847900/eugene-goostman-turing-test/

https://institutions.newscientist.com/article/dn28206-forget-the-turing-test-there-are-better-ways-of-judging-ai/



http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1116842&dswid=4708

http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1116842&dswid=4708

http://refhub.elsevier.com/S0007-6813(19)30117-X/sref4







https://chatbotslife.com/how-chatbot-performance-metrics-differ-by-industry-b380d4bd7f6b

https://chatbotslife.com/how-chatbot-performance-metrics-differ-by-industry-b380d4bd7f6b















Chung, M., Joung, H., & Ko, E. (2017b). The role of luxury brand’sconversational agents: Comparison between face-to-face andchatbot. In Proceedings of the 2017 global fashion manage-ment Conference at Vienna. Available at: http://gfmcproceedings.net/html/sub3_01.html?codeZ326161

Ciechanowski, L., Przegalinska, A., Magnuski, M., & Gloor, P.(2019). The shades of the uncanny valley: An experimentalstudy of humanechatbot interaction. FGCS: Future Gener-ation Computer Systems, 92, 539e548.

Ciechanowski, L., Przegalinska, A., & Wegner, K. (2018). Thenecessity of new paradigms in measuring human-chatbotinteraction. In M. Hoffman (Ed.), Advances in cross-cultural decision making. Cham, Switzerland: SpringerInternational.

Cova, B., & Cova, V. (2002). Tribal marketing: The tribalisationof society and its impact on the conduct of marketing. Eu-ropean Journal of Marketing, 36(5/6), 595e620.

Curry, C., & O’Shea, J. D. (2012). The implementation of astory-telling chatbot. Advances in Smart Systems Research,1(1), 45e52.

De Oliveira, J. M., & Gloor, P. A. (2018). GalaxyScope: Findingthe “truth of tribes” on social media. In F. Grippa, J. Leitao,J. Gluesing, K. Riopelle, & P. Gloor (Eds.), Collaborativeinnovation networks: Building adaptive and resilient orga-nizations (pp. 153e164). Cham, Switzerland: SpringerInternational.

Dijkstra, J. J., Liebrand, W. B. G., & Timminga, E. (1998).Persuasiveness of expert systems. Behaviour & InformationTechnology, 17(3), 155e163.

Dunn, J. (2016). We put Siri, Alexa, Google Assistant, and Cor-tana through a marathon of tests to see who’s winning thevirtual assistant race e here’s what we found. Business In-sider. Available at: https://www.businessinsider.com/siri-vs-google-assistant-cortana-alexa-2016-11

Dunn, J. R., & Schweitzer, M. E. (2005). Feeling and believing:The influence of emotion on trust. Journal of Personalityand Social Psychology, 88(5), 736e748.

Fiske, S. T., Cuddy, A. J. C., Glick, P., & Xu, J. (2002). A modelof (often mixed) stereotype content: Competence andwarmth respectively follow from perceived status andcompetition. Journal of Personality and Social Psychology,82(6), 878e902.

Fong, T., Nourbakhsh, I., & Dautenhahn, K. (2002). A survey ofsocially interactive robots: Concepts, design, and applica-tions (CMU-RI-TR02-29). Pittsburgh, PA: Carnegie MellonUniversity.

French, R. M. (2012). Moving beyond the turing test. Commu-nications of the ACM, 55(12), 74e77.

Gartner. (2018). 5 trends emerge in the Gartner Hype Cycle foremerging technologies, 2018. Available at: https://www.gartner.com/smarterwithgartner/5-trends-emerge-in-gartner-hype-cycle-for-emerging-technologies-2018/

Gloor, P. A., Colladon, A. F., de Oliveira, J. M., & Rovelli, P.(2018). Identifying tribes on Twitter through shared context.In Y. Song, F. Grippa, P. A. Gloor, & J. Leitao (Eds.),Collaborative innovation networks. Cham, Switzerland:Springer.

Grunig, J. E. (2002). Qualitative methods for assessing re-lationships: Between organizations and publics. Availableat: https://instituteforpr.org/wp-content/uploads/2002_AssessingRelations.pdf

Hachman, M. (2014). Battle of the digital assistants: Windowsphone Cortana vs Google now vs Siri. PC World. Available at:https://www.pcworld.com/article/2142022/the-battle-of-the-digital-assistants-windows-phone-cortana-vs-google-now-vs-siri.html

Harwell, D. (2018). A Google program can pass as a human onthe phone. Should it be required to tell people it’s a ma-chine? The Washington Post. Available at: https://www.washingtonpost.com/news/the-switch/wp/2018/05/08/a-google-program-can-pass-as-a-human-on-the-phone-should-it-be-required-to-tell-people-its-a-machine/?utm_termZ.cf13615d7898

Hayes, A. (2017). Amazon Alexa: A quick-start beginner’sguide. Scotts Valley, CA: CreateSpace Independent Pub-lishing Platform.

Hernandez-Orallo, J. (2000). Beyond the turing test. Journal ofLogic, Language and Information, 9(4), 447e466.

HOMS Solicitors. (2019). Trusts: Tax implications. Available at:https://www.homs.ie/publications/trusts-tax-implications/

Hurley, R. F. (2011). The decision to trust: How leaders createhigh-trust organizations. Hoboken, NJ: John Wiley & Sons.

Institute for Public Relations. (2003). Guidelines for measuringtrust in organizations. Available at: https://instituteforpr.org/guidelines-for-measuring-trust-in-organizations-2/

Kacprzyk, J., & Zadrozny, S. (2010). Computing with words is animplementable paradigm: Fuzzy queries, linguistic datasummaries, and natural-language generation. IEEE Trans-actions on Fuzzy Systems, 18(3), 461e472.

Kaplan, A., & Haenlein, M. (2019). Siri, Siri, in my hand: Who’sthe fairest in the land? On the interpretations, illustrations,and implications of artificial intelligence. Business Horizons,62(1), 15e25.

Khan, R., & Das, A. (2018). Introduction to chatbots. In R. Khan, &A. Das (Eds.), Build better chatbots: A complete guide to get-ting started with chatbots (pp. 1e11). Berkeley, CA: Apress.

Kluwer, T. (2011). From chatbots to dialog systems. In D. Perez-Marin, & I. Pascual-Nieto (Eds.), Conversational agents andnatural language interaction: Techniques and effectivepractices (pp. 1e22). Hershey, PA: IGI Global.

Leviathan, Y., & Matias, Y. (2018). Google Duplex: An AI systemfor accomplishing real-world tasks over the phone. Google AIblog. Available at: https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html

Lopez, G., Quesada, L., & Guerrero, L. A. (2017). Alexa vs. Sirivs. Cortana vs. Google assistant: A comparison of speech-based natural user interfaces. In I. L. Nunes (Ed.), Ad-vances in human factors and systems interaction (pp.241e250). Cham, Switzerland: Springer.

Lotze, N. (2016). Chatbots. Peter Lang. Available at: https://www.peterlang.com/view/title/18967

Meda, D. (2017). The future of work: The meaning and value ofwork in Europe. HAL Archives. Available at: https://hal.archives-ouvertes.fr/hal-01616579/

Moor, J. H. (2003). The status and future of the Turing Test.Minds and Machines, 11(1), 77e93.

Mori, M., MacDorman, K. F., & Kageki, N. (2012). The uncannyvalley. IEEE Robotics and Automation Magazine, 19(2),98e100.

Morrissey, K., & Kirakowski, J. (2013). ‘Realness’ in chatbots:Establishing quantifiable criteria. In M. Kurosu (Ed.), Human-computer interaction: Interaction modalities and tech-niques (pp. 87e96). Berlin, Germany: Springer.

Moutinho, L., Dionısio, P., & Leal, C. (2007). Surf tribal behav-iour: A sports marketing application. Marketing Intelligence& Planning, 25(7), 668e690.

Nordheim, C. B. (2018). Trust in chatbots for customer service:Findings from a questionnaire study. Available at: https://www.duo.uio.no/bitstream/handle/10852/63498/1/CecilieBertinussenNordheim_masteroppgaveV18.pdf

Okuda, T., & Shoda, S. (2018). AI-based chatbot service forfinancial industry. Fujitsu Scientific and Technical Journal,54(2), 4e8.

http://gfmcproceedings.net/html/sub3_01.html?code=326161

http://gfmcproceedings.net/html/sub3_01.html?code=326161































https://www.businessinsider.com/siri-vs-google-assistant-cortana-alexa-2016-11

https://www.businessinsider.com/siri-vs-google-assistant-cortana-alexa-2016-11


















https://www.gartner.com/smarterwithgartner/5-trends-emerge-in-gartner-hype-cycle-for-emerging-technologies-2018/








https://instituteforpr.org/wp-content/uploads/2002_AssessingRelations.pdf

https://instituteforpr.org/wp-content/uploads/2002_AssessingRelations.pdf

https://www.pcworld.com/article/2142022/the-battle-of-the-digital-assistants-windows-phone-cortana-vs-google-now-vs-siri.html



https://www.washingtonpost.com/news/the-switch/wp/2018/05/08/a-google-program-can-pass-as-a-human-on-the-phone-should-it-be-required-to-tell-people-its-a-machine/?utm_term=.cf13615d7898











https://www.homs.ie/publications/trusts-tax-implications/



https://instituteforpr.org/guidelines-for-measuring-trust-in-organizations-2/

https://instituteforpr.org/guidelines-for-measuring-trust-in-organizations-2/




















https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html

https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html







https://www.peterlang.com/view/title/18967

https://www.peterlang.com/view/title/18967

https://hal.archives-ouvertes.fr/hal-01616579/

https://hal.archives-ouvertes.fr/hal-01616579/

















https://www.duo.uio.no/bitstream/handle/10852/63498/1/CecilieBertinussenNordheim_masteroppgaveV18.pdf








Petkovic, D., Kobzik, L., & Re, C. (2018). Machine learning anddeep analytics for biocomputing: Call for better explain-ability. Pacific Symposium on Biocomputing, 23, 623e627.

Phrasee. (2016). PARRY: The AI chatbot from 1972. Available at:https://phrasee.co/parry-the-a-i-chatterbot-from-1972/

Pochwatko, G., Giger, J.-C., Ro _za�nska-Walczuk, M.,�Swidrak, J., Kukiełka, K., Mo _zaryn, J., et al. (2015). Polishversion of the negative attitude toward robots scale (NARS-PL). Journal of Automation Mobile Robotics and IntelligentSystems, 9(3), 65e72.

Preece, A. (2018). Asking ‘why’ in AI: Explainability of intelli-gent systems e perspectives and challenges. IntelligentSystems in Accounting, Finance and Management, 25(2),63e72.

Premack, D., & Woodruff, G. (1978). Does the chimpanzee havea theory of mind? Behavioral and Brain Sciences, 1(4),515e526.

Przegalinska, A. (2018). Wearable technologies in organiza-tions: Privacy, efficiency and autonomy in work. New York,NY: Springer.

Rabinowitz, N. C., Perbet, F., Song, H. F., Zhang, C., AliEslami, S. M., & Botvinick, M. (2018). Machine theory ofmind. Cornell University. Available at: https://arxiv.org/abs/1802.07740.pdf

Radziwill, N. M., & Benton, M. C. (2017). Evaluating quality ofchatbots and intelligent conversational agents. CornellUniversity. Available at: http://arxiv.org/abs/1704.04579.pdf

Seeger, A.-M., & Heinzl, A. (2018). Human versus machine:Contingency factors of anthropomorphism as a trust-inducing design strategy for conversational agents. InF. Davis, R. Riedl, J. vom Brocke, P.-M. Leger, & A. Randolph(Eds.), Information systems and neuroscience (pp.129e139). Cham: Springer International.

Shah, H., Warwick, K., Vallverdu, J., & Wu, D. (2016). Canmachines talk? Comparison of Eliza with modern dialoguesystems. Computers in Human Behavior, 58, 278e295.

Shawar, B. A., & Atwell, E. S. (2005). Using corpora in machine-learning chatbot systems. International Journal of CorpusLinguistics, 10(4), 489e516.

Shawar, B. A., & Atwell, E. (2007a). Chatbots: Are they reallyuseful? LDV Forum, 22(1), 29e49.

Shawar, B. A., & Atwell, E. (2007b). Different measurementsmetrics to evaluate a chatbot system. In Proceedings of theworkshop on bridging the gap: Academic and industrialresearch in dialog technologies (pp. 89e96). Stroudsburg,PA: Association for Computational Linguistics.

Shockley-Zalabak, P. S., Morreale, S., & Hackman, M. (2010).Building the high-trust organization: Strategies for sup-porting five key dimensions of trust. Hoboken, NJ: JohnWiley & Sons.

Sloman, A. (2014). Judging chatbots at turing test 2014.Available at: https://www.cs.bham.ac.uk/research/projects/cogaff/misc/turing-test-2014.pdf

Stojni�c, A. (2015). Digital anthropomorphism: Performers, av-atars, and chat-bots. Performance Research, 20(2), 70e77.

Taddeo, M. (2010). Trust in technology: A distinctive and aproblematic relation. Knowledge, Technology & Policy,23(3), 283e286.

de Visser, E. J., Krueger, F., McKnight, P., Scheid, S., Smith, M.,Chalk, S., et al. (2012). The world is not enough: Trust incognitive agents. In Proceedings of the annual meeting ofthe human factors and ergonomics society (pp. 263e267).Santa Monica, CA: HFES.

Waytz, A., Heafner, J., & Epley, N. (2014). The mind in themachine: Anthropomorphism increases trust in an autono-mous vehicle. Journal of Experimental Social Psychology,52, 113e117.

Weizenbaum, J. (1966). ELIZAda computer program for thestudy of natural language communication between man andmachine. Communications of the ACM, 9(1), 36e45.

Weizenbaum, J., & McCarthy, J. (1976). Computer power andhuman reason: From judgment to calculation. New York, NY:W.H. Freeman and Company.

Wimmer, H., & Perner, J. (1983). Beliefs about beliefs: Repre-sentation and constraining function of wrong beliefs inyoung children’s understanding of deception. Cognition,13(1), 103e128.

Zadrozny, W., Budzikowska, M., Chai, J., Kambhatla, N.,Levesque, S., & Nicolov, N. (2000). Natural language dia-logue for personalized interaction. Communications of theACM, 43(8), 116e120.





https://phrasee.co/parry-the-a-i-chatterbot-from-1972/
























https://arxiv.org/abs/1802.07740.pdf

https://arxiv.org/abs/1802.07740.pdf

http://arxiv.org/abs/1704.04579.pdf

http://arxiv.org/abs/1704.04579.pdf





























https://www.cs.bham.ac.uk/research/projects/cogaff/misc/turing-test-2014.pdf

https://www.cs.bham.ac.uk/research/projects/cogaff/misc/turing-test-2014.pdf






































Documents

In bot we trust: A new methodology of chatbot performance … · in this article is to track the evolution of chatbots from simple syntactic systems to robust natural language processing