6
Chat mining: Automatically determination of chat conversations’ topic in Turkish text based chat mediums Özcan Özyurt * , Cemal Köse Karadeniz Technical University, Department of Computer Engineering, Faculty of Engineering, 61080 Trabzon, Turkey article info Keywords: Chat mining Topic detection Chat conversations Feature selection Text classification abstract Mostly, the conversations taking place in chat mediums bear important information concerning the speakers. This information can vary in many fields such as tendencies, habits, attitudes, guilt situations, and intentions of the speakers. Therefore, analysis and processing of these conversations are of much importance. Many social and semantic inferences can be made from these conversations. In determining characteristics of conversations and analysis of conversations, subject designation can be grounded on. In this study, chat mining is chosen as an application of text mining, and a study concerning determi- nation of subject in the Turkish text based chat conversations is conducted. In sorting the conversations, supervised learning methods are used in this study. As for classifiers, Naive Bayes, k-Nearest Neighbor and Support Vector Machine are used. Ninety-one percent success is achieved in determination of subject. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction With the development of internet, the computer became an important communication means. In this wise, chat conversations are widely used as text based communication tools. Chat mediums are one of the communication mediums which are used by people from all ages frequently. The importance of social and semantic inferences from chat mediums is increasing day by day with this much usage and extension of these mediums (Haichao, Siu, & Yu- lan, 2006; Khan, Fisher, Shuler, Tianhao, & Pottenger, 2002; Kose & Ozyurt, 2006; Kose, Ozyurt, & Ikibas, 2008). From this point of view, it becomes necessary to analyse these conversations and to understand the characteristics of the speakers. One of the most important factors in analysing the chat conversations is determina- tion of conversation topic (Haichao et al., 2006). Logs which are kept in the computer constitute important data sources in commu- nication used in chat mediums. With manipulation of these files and implementation of data mining rules, basic characteristics of the speakers can be deducted (Bengel, Gauch, Mitter, & Vijayaragh- avan, 2004; Bing, Xiaoli, Wee, & Philip, 2004; Haichao et al., 2006). Thus, much beneficial information such as guiltiness analysis, ten- dencies of speakers, and area of interests will be attained through conversations. With the aid of machine learning techniques, data mining and good analyse of chat conversations, it will be possible to develop applications such as determining terrorist attacks and making guiltiness analysis in the near future (Khan, Fisher, Shuler, Tianhao, & Pottenger, 2002; Kolenda, Hansen, & Larsen, 2001; Elnahrawy, 2002; Haichao et al., 2006). In text mining or chat mining applications, two methods, unsu- pervised and supervised learning, are applied for data classification (Amasyali & Diri, 2006; Han & Kamber, 2006; Han, Karypis, & Ku- mar, 2001; Koppel, Argamon, & Shimoni, 2002). The information obtained in unsupervised learning is examined, and the data matching with each other are aggregated in a cluster while the ones not matching with each other are aggregated in another clus- ter. This event is called as clustering. In the clustering operation, there are no preset classes. Therefore, this kind of learning is named as unsupervised. The case which consists of preset classes is called as supervised learning. The operation which is the coun- terpart of clustering in supervised learning is classification opera- tion. One of the biggest problems of the supervised approach is primarily to determine the classes precisely and accurately by using training sets. Once the classes are determined, these ap- proaches are easier and more effective compared to the unsuper- vised approaches. On the other hand, all topics can be obtained from the text in unsupervised approaches. However, it is more dif- ficult and complicated compared to the supervised approaches (Bingham, Kab, & Girolami, 2003; Han & Kamber, 2006; Joachims, 1998; Kolenda et al., 2001). In text mining applications, determination of conversation topic is one of the important study areas. Most of the studies made in this area are conducted on classification of news texts. Other stud- ies on this area are related to the determination of text writer’s characteristics (Koppel et al., 2002; Amasyali & Diri, 2006). With pervasion of chat conversations and these mediums’ becoming an 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.06.053 * Corresponding author. Tel.: +90 0462 8716922x8562; fax: +90 462 8717424. E-mail addresses: [email protected] (Ö. Özyurt), [email protected] (C. Köse). Expert Systems with Applications 37 (2010) 8705–8710 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Chat mining: Automatically determination of chat conversations’ topic in Turkish text based chat mediums

Embed Size (px)

Citation preview

Page 1: Chat mining: Automatically determination of chat conversations’ topic in Turkish text based chat mediums

Expert Systems with Applications 37 (2010) 8705–8710

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Chat mining: Automatically determination of chat conversations’ topic inTurkish text based chat mediums

Özcan Özyurt *, Cemal KöseKaradeniz Technical University, Department of Computer Engineering, Faculty of Engineering, 61080 Trabzon, Turkey

a r t i c l e i n f o

Keywords:Chat miningTopic detectionChat conversationsFeature selectionText classification

0957-4174/$ - see front matter � 2010 Elsevier Ltd. Adoi:10.1016/j.eswa.2010.06.053

* Corresponding author. Tel.: +90 0462 8716922x8E-mail addresses: [email protected] (Ö. Özyurt),

a b s t r a c t

Mostly, the conversations taking place in chat mediums bear important information concerning thespeakers. This information can vary in many fields such as tendencies, habits, attitudes, guilt situations,and intentions of the speakers. Therefore, analysis and processing of these conversations are of muchimportance. Many social and semantic inferences can be made from these conversations. In determiningcharacteristics of conversations and analysis of conversations, subject designation can be grounded on.

In this study, chat mining is chosen as an application of text mining, and a study concerning determi-nation of subject in the Turkish text based chat conversations is conducted. In sorting the conversations,supervised learning methods are used in this study. As for classifiers, Naive Bayes, k-Nearest Neighborand Support Vector Machine are used. Ninety-one percent success is achieved in determination ofsubject.

� 2010 Elsevier Ltd. All rights reserved.

1. Introduction

With the development of internet, the computer became animportant communication means. In this wise, chat conversationsare widely used as text based communication tools. Chat mediumsare one of the communication mediums which are used by peoplefrom all ages frequently. The importance of social and semanticinferences from chat mediums is increasing day by day with thismuch usage and extension of these mediums (Haichao, Siu, & Yu-lan, 2006; Khan, Fisher, Shuler, Tianhao, & Pottenger, 2002; Kose& Ozyurt, 2006; Kose, Ozyurt, & Ikibas, 2008). From this point ofview, it becomes necessary to analyse these conversations and tounderstand the characteristics of the speakers. One of the mostimportant factors in analysing the chat conversations is determina-tion of conversation topic (Haichao et al., 2006). Logs which arekept in the computer constitute important data sources in commu-nication used in chat mediums. With manipulation of these filesand implementation of data mining rules, basic characteristics ofthe speakers can be deducted (Bengel, Gauch, Mitter, & Vijayaragh-avan, 2004; Bing, Xiaoli, Wee, & Philip, 2004; Haichao et al., 2006).Thus, much beneficial information such as guiltiness analysis, ten-dencies of speakers, and area of interests will be attained throughconversations. With the aid of machine learning techniques, datamining and good analyse of chat conversations, it will be possibleto develop applications such as determining terrorist attacks andmaking guiltiness analysis in the near future (Khan, Fisher, Shuler,

ll rights reserved.

562; fax: +90 462 [email protected] (C. Köse).

Tianhao, & Pottenger, 2002; Kolenda, Hansen, & Larsen, 2001;Elnahrawy, 2002; Haichao et al., 2006).

In text mining or chat mining applications, two methods, unsu-pervised and supervised learning, are applied for data classification(Amasyali & Diri, 2006; Han & Kamber, 2006; Han, Karypis, & Ku-mar, 2001; Koppel, Argamon, & Shimoni, 2002). The informationobtained in unsupervised learning is examined, and the datamatching with each other are aggregated in a cluster while theones not matching with each other are aggregated in another clus-ter. This event is called as clustering. In the clustering operation,there are no preset classes. Therefore, this kind of learning isnamed as unsupervised. The case which consists of preset classesis called as supervised learning. The operation which is the coun-terpart of clustering in supervised learning is classification opera-tion. One of the biggest problems of the supervised approach isprimarily to determine the classes precisely and accurately byusing training sets. Once the classes are determined, these ap-proaches are easier and more effective compared to the unsuper-vised approaches. On the other hand, all topics can be obtainedfrom the text in unsupervised approaches. However, it is more dif-ficult and complicated compared to the supervised approaches(Bingham, Kab, & Girolami, 2003; Han & Kamber, 2006; Joachims,1998; Kolenda et al., 2001).

In text mining applications, determination of conversation topicis one of the important study areas. Most of the studies made inthis area are conducted on classification of news texts. Other stud-ies on this area are related to the determination of text writer’scharacteristics (Koppel et al., 2002; Amasyali & Diri, 2006). Withpervasion of chat conversations and these mediums’ becoming an

Page 2: Chat mining: Automatically determination of chat conversations’ topic in Turkish text based chat mediums

Table 1Information of conversations gathered from chat mediums.

Statistical information Numericalvalues

Total number of conversations 154Total number of words used in the conversations 24,993Total number of most frequently used 20 words 4195Words with spelling error 2454Proportion of number of words with spelling error to the total

number of the words9.8%

The number of Acronyms, short forms and icons 2167Proportion of the number of most frequently used 20 words to

the total number of the words16.8%

Proportion of the number of most frequently used 50 words tothe total number of words

31.4%

Proportion of the number of most frequently used 100 wordsto the total number of words

44.3%

8706 Ö. Özyurt, C. Köse / Expert Systems with Applications 37 (2010) 8705–8710

accumulation of data, various studies are conducted in this area.Determination of speakers’ characteristics, genders and determina-tion of the subject of the conversation constitute the basis of thesestudies. These studies can be gathered under the name of chat min-ing. Various studies were conducted in this area using differentclassification techniques. Regression models, k-Nearest Neighborsclassification (k-NN), Decision Trees, Bayesian probabilistic ap-proaches, neural networks and Support Vector Machines (SVM)are widely used in chat mining applications (Bengel et al., 2004;Elnahrawy, 2002; Haichao et al., 2006; Kose et al., 2008). Elnah-rawy (2002) showed an offline topic categorization approach ana-lysing chat conversation logs related to criminal activities ispresented. Here, logs are first pre-processed by removing stop-words and then converted into term frequency weighted vectors.Then, categorization techniques including k-NN, Naive Bayes andlinear SVM are employed for topic classification. Bengel et al.(2004) also adopted a categorization approach for analysing chatmessages from Internet Relay Chat. In this study, archived chatconversations are filtered on the basis of time, channel and speak-er. The resultant collections of chat conversations are grouped as‘‘sessions” for processing and categorization. Each of these ‘‘ses-sions” is pre-processed with stop-word removal and stemming,and then represented using TFIDF weight scheme for classification.Haichao et al. (2006) indicative term-based approach is used fordetermination of the talked subject in chat conversations. It is triedto classify subjects of conversations by using supervised ap-proaches in this study. Kose et al. (2008) conducted a study con-cerning the determination of genders of speakers as an exampleof information inference from chat conversations. Chat conversa-tions were analysed, and comparative method was used for deter-mination of speakers’ genders.

In this study, structural analysis of chat conversations weremade, and it was attempted to determine characteristics of conver-sations via methods of machine learning and data mining. Subjectof the talked conversation was tried to be determined as data infer-ence sample from chat conversations. The main purpose of thestudy is to find answer to the question of ‘‘Which subject are thespeakers talking about?” Primary originality of our study is to ob-tain social and logical inferences from Turkish text based chat con-versations. For this purpose, we discoursed on determination of thespoken subject in chat conversations by using term-based ap-proach. Supervised learning techniques were used here. In otherwords, in the analysis of chat conversations, a specific conversationcluster was used as training set, and subjects were determinedbeforehand. In determination subjects of chat conversations, clas-ses were fixed beforehand so as to determine commonly spokentopics in limited number of conversations. As for the test set, itwas tried to determine the preset speech subjects in which theconversations which were being analysed were included. NaiveBayes, k-Nearest Neighbor, Support Vector Machine methods wereused in the classification of conversations. The rest of this paper isorganized as follows. Collection and analysis of data are given inSection 2. Determining Characteristics of Chat Conversations arediscussed in Section 3. Phases of subject determination from chatconversation operation and its details are presented in Section 4.The implementation and results are discussed in Section 5. Theconclusion and future work are given in Section 6.

2. Collection and analysis of data

For the evaluation of chat conversations, data were gatheredfrom chat mediums by using msn messenger log files and mIRC(Shareware Microsoft Relay Chat) which are widely used. The gath-ered data are text files in which conversations are kept. The dataexisting in these files were subjected to pre-treatment, and were

prepared for data mining. In the pre-treatment of the data, basicsteps of the data mining were taken into consideration. Total sizeof the conversations is 4.7 Mb. The conversation which has theshortest duration among the conversations is 1 min while the lon-gest one is 155 min. One hundred and fifty four conversations weregathered from these mediums and 75 of them were used as train-ing sets. The rest 79 conversations were used as test data sets.Main features of the conversations are presented in Table 1.

Statistical information such as total number of words or num-ber of words with spelling errors is also given in Table 1. Besidesa list of the total number of words and of most frequently usedwords in the conversations was made. The proportion of most fre-quently used 20, 50 and 100 words to total number of words werefound as 16.8%, 31.4%, and 44.3%, respectively. This demonstratesthat specific words are frequently used in conversations.

In accordance with the gathered data, nearly half of the wordsused in conversations consist of most frequently used 100 words.Considering the total number of the words as 24,993, it is seen thatalmost half of the conversation is made up of the same words. Thisdemonstrates that specific words are frequently used in chat medi-ums. Another conspicuous situation in chat conversations is pro-portion of words with spelling errors to the total number ofwords. As can be read from the table also, the number of wordswith spelling errors consist 9.8% of the total words. While the num-ber of the words with spelling errors was calculated, exceptionslike acronyms, short forms and icon numbers were not taken intoconsidered.

When number of these words is taken into consideration, it isseen that 19% of the words used in conversations were writtenincorrectly or deficiently. This means that one out of each fivewords is spelled with errors. When this information was taken intoaccount, it is easily seen that chat conversations bear far much dif-ferences from news or normally written texts.

3. Determining characteristics of chat conversations

When chat conversations are examined, it is seen that the con-tent is different from the normally written materials. This origi-nates from the nature of chat conversations. There are manymistakenly written words in the conversations that take place inchat mediums. A lot of short forms, signs and words which havespecial meanings are used in these conversations. In addition tothis, the frequency of the words which are used in chat conversa-tions, and the length of the sentences bear differences comparedto the normal texts (Haichao et al., 2006; Khan et al., 2002; Kose& Ozyurt, 2006; Kose et al., 2008). From this point of view, chat lan-guage is rather different from the normal writing language interms of syntax. In this study, firstly, conversation texts obtained

Page 3: Chat mining: Automatically determination of chat conversations’ topic in Turkish text based chat mediums

Table 2Acronym and short form examples used in conversations.

Acronyms Meaning Short forms Meaning

K_IB Take care tmm OK

AEO God be with you tlf PhoneSÇS I love you üniv UniversityARO God bestow mercy upon you ins� If God willsSG See you later cvp Response

Table 3Samples from signs used in conversations.

Signs Meaning

:), :)))))), :)), :-)), Laughing:D, :d Laughing loudly?, ?-, ????, _?, . . .!? Question and asking other meanings?:P, :PPP, :p To show tongue;), ;)), ;))) To blink:(, :((, :((( Unhappiness:-), :=), :-))) Laughing

Table 4Characteristics of conversations used in training set.

Determination conversation topic Numbers Percentage

Total number of conversations in training set 75 100Number of conversations whose subject could not

be determined7 9.3

Number of conversations focusing on single topic 25 33.3Number of conversations focusing on two or more

topics43 57.4

Ö. Özyurt, C. Köse / Expert Systems with Applications 37 (2010) 8705–8710 8707

from chat mediums were examined, and it was tried to determinebasic characteristics of chat conversations.

3.1. Chat language

In real-time and informal environment of IM (Instant Messages)systems, chat messages are very different from conventional text(Haichao et al., 2006; Kose & Ozyurt, 2006; Kose et al., 2008).Therefore, chat language includes acronyms, short forms, polyse-mes, synonyms and mis-spelling of terms. However, it is possibleto find mistakenly written words and irregular short forms apartfrom formal grammatical rules in the texts. Special expressionsand spelling errors which are commonly encountered in chat con-versations can be grouped as follows.

Acronyms are formed by extracting the first letters of a se-quence of words. For example, ‘‘KIB” is an acronym for ‘‘Kendine_Iyi Bak (Take care)”, ‘‘SÇS” is an acronym for ‘‘Seni Çok Seviyorum(I love you)” and ‘‘AEO” is an acronym for ‘‘Allah’a Emanet Ol (Godbe with you)”.

Short forms refer to the case in which a lengthy word is re-placed with a shorter alternative expression. For example, tmm isa short form for ‘‘tamam (okey)”, ts�k is a short form for ‘‘tes�ekkürederim (thank you)”. Unlike acronyms, it is observed that onlysome popular short forms have fixed expressions among differentchat participants. Many short forms are highly subjective to thecontext of the conversation and chatters. Table 2 shows someexample short forms, and some of the most popular acronyms.

Icons are used in conversations, such as :), :)))))), :)) (Laughing),?, ?-, ????, _?, . . .!? (question and asking other meanings?), :P,:PPP, :p (to show tongue), :(, :((, :((((Unhappiness). Some of theseicons mean same though their spellings are different. Some iconsused in conversations are listed in Table 3.

Mis-spelling of terms is seen more frequently in chat conversa-tions than formal text documents due to nature of chat document.There are also some cases in which a chat participant purposelymis-spells a word to emphasize its meaning. A common case formis-spelling is the use of duplicated vowels, such as ‘‘evettttt”,‘‘yawwwww” and ‘‘okkkk” instead of ‘‘evet (yes)”, ‘‘yahu (Hey!)”and ‘‘okey (okey)” respectively. The number of duplications is notfixed.

4. Determination of conversations’ topic

As a nature of chat conversations, the subject which is beingtalked about may change quickly in any chat logins (Khan et al.,

2002; Tianhao, Khan, Fisher, Shuler, & Pottenger, 2002; Elnahrawy,2002; Haichao et al., 2006). While talking on a matter, the subjectmay switch to another and then back to the previous one. In addi-tion to this, speakers talk about a different matter switching thesubject in the same conversation. Thence, it can be reached tothe conclusion that one or more subjects are being talked in any lo-gins. On the other hand, no subject can be deduced from some con-versations. These kinds of conversations are seen as shortconversations which are not about a specific matter. Concordantly,determination of the conversation topic bears important difficul-ties. Table 4 shows the statistics on the number of topics discussedin the collected set of chat conversations.

In accordance with the data gathered from the conversationsused in training set, 9.3% of the conversations could not be deter-mined. On the other hand, 33.3% of the conversations focused onsingle topic while 57.4% focused on two or more topics. The con-versations whose topics could not be determined are usually veryshort and consist of 5–6 words. A large majority of the conversa-tions focusing on single topic composed of 15–25 words. Most ofthe long conversations focused on two or more topics.

4.1. Identifying basic patterns of conversation threads

In chat conversations, as a part of determination of conversationsubject, the thread and the ending of conversation hold importance(Khan et al., 2002). As a nature of chat conversations, conversationmay change and shape continuously. Therefore, before determina-tion of conversation topic, determination of thread and ending canbe helpful in determining topic or topics in spite of its simple level.The beginning of the topic can be seen as thread (Khan et al., 2002).It is seen that this process is easier while evaluating conversationswhich have single topic. When there is more than one topic, it be-comes harder to fix thread and ending. In this study, especiallythreads and endings of conversations which have single topic wereaccurately determined. As for the conversations which have morethan one topic, some 85% success was achieved in determiningthread and ending of the subjects in conversations. For determina-tion of the threads and endings of the subjects and threading ofconversations which have more than one topic, a detailed and dif-ferent study is necessary.

Before the determination of conversation topic, determinationof patterns used in fixing threads and endings was put into prac-tice. At the thread of any topic, direct or indirect expressions whichindicate that the conversation or subject has started should bedetermined (Khan et al., 2002). According to the data gatheredfrom chat conversations, conversations generally start with calling,greeting and asking names and addresses. It is understood that aconversation starts when expressions like ‘‘slm (hi, hello), nbr-nasılsın (how are you)” are used. Together with these words, ques-tions or normal sentences may also be used. For example, postingssuch as ‘‘Slm Ahmet, dün aks�amki maçı izledin mi? (Hi Ahmet, didyou watch football match last night?)”, ‘‘Ali, nbr, Nasılsın? (Hello,how are you, Ali?)”, ‘‘Tatilde nereye gideceksin? (Where will yougo to holiday?)”, etc. Our approach relies on the fact that utilization

Page 4: Chat mining: Automatically determination of chat conversations’ topic in Turkish text based chat mediums

8708 Ö. Özyurt, C. Köse / Expert Systems with Applications 37 (2010) 8705–8710

of these patterns indicates that a new thread started. This kind ofpatterns can be evaluated as direct patterns. Also they can benamed as ‘‘starting patterns”. Here, mis-spellings were taken intoconsideration in greetings and introduction words. In other words,words which mean same even though they were written differ-ently because of spelling errors were accepted as the same. Forexample, all ‘‘meraba, mrb, mrh, merhaba” words were taken as‘‘merhaba (hello). Therefore, utilization of any of these was evalu-ated as the same meaning.

In any conversation text, it is important to know whether thesame topic continues or not as well as determination of conversa-tion topic. By this means, determination of one or more topic canbe made. Considering the difficulty in determining thread and end-ing of the subject, it can be realized that direct patterns are not so-lely sufficient for this process. Therefore, indirect patterns whichare purposive for determination of topic continuation should bedetermined and used. Exemplifying the indirect patterns, utiliza-tion of anaphoric relations like ‘‘O (He/She), Bu/S�u (This, that)can be given. These kinds of patterns are used in continuation ofprevious topic though they may be a start for a new thread rarely.In addition to this, another kind of indirect patterns is the length ofexpressions. If expressions are short, this generally does not showthe start of a new thread. When a new topic starts, long expres-sions are used generally. As for the short responses such as ‘‘evet(yeah, yes), hayır (no), katılıyorum (I agree), etc.”, these kinds ofpatterns are generally in the continuation of a thread. These kindsof patterns can be used in the continuation of a thread as well asending however not in the threads. These patterns can be namedas ‘‘continuing patterns”. In order to determine the ending of thethread, conversation finishing patterns should be determined. Inother words ‘‘stopping patterns” should be fixed. Evaluating thedata gathered from the conversations, some patterns such asgörüs�mek üzere (bye), anlas�ıldı (got it), tamam (ok) which are usedin finishing conversations were extracted. When these kinds ofstopping patterns are seen, it is understood that the thread is com-pleted. In addition to short responses which are used in continua-tion of conversations are mostly used in stopping the threads.Therefore, when threads are seen in the continuation of short re-sponses, it is decided that a new topic is started. If there was noconversation expression after short responses, it was assessed asthe end of thread since most of the time any conversation canend with short responses like ‘‘ok (okay)”, ‘‘peki (all right)”, ‘‘ta-mam (ok)”, ‘‘anlas�ıldı (got it)”.

In evaluating the conversations, these patterns were taken intoconsideration, and thread and ending of topics were tried to bedetermined. In determining the conversation topic, threads men-tioned in conversations were utilized. It helped us in solving thisproblem to determine the number of topics mentioned in the con-versation before dealing with the determination of topic or topicsin any conversation helped us in solving this problem.

Table 5A sample indicative Features set for ‘‘sport and slang/swearword topic

Order Term 1 Term 2

1 Spor (sports) Futbol (foot2 Oyuncu (player) Futbolcu (fo3 Forvet (striker) orta saha (m4 Faul (foul) Penaltı (pen. . .

1 Lan (man) Ulan (buddy2 Yahu (for god’s sake!) Yaw (why!)3 a.k (fuck) a.q (fuck)4 Manyak (maniac) Salak (fool). . .

4.2. Feature selection and topic detection

In this paper, a study concerning determination of conversationtopic of chat conversation was conducted. In determining conver-sation topic, preset topics were used. In other words, topics whichwere commonly spoken or mentioned were determined by analys-ing conversations used in training set. Taking these facts into con-sideration, we dwelled on the determination of what could be topicor topics of any conversation. While determining topics, most com-monly talked topics in chat mediums were paid attention, and itwas tried to determine which of these topics or topic were in-cluded in that conversation. In accordance with data gathered fromchat conversations, topics mentioned in conversations were classi-fied into five divisions. These were determined as ‘‘sports, love/marriage, education, slang/swearing, entertainment”. In order todetermine topics apart from these a sixth division was constitutedunder the name of ‘‘others”. Navie Bayes, k-Nearest Neighbor andSupport Vector Machine methods were used as classifiers. Classi-fier applications were implemented through a software namedWeka which is open source encoded and on internet (Witten &Frank, 2000).

In order to determine topic or topics mentioned in conversa-tions, feature clusters should be determined principally. In theselection of feature, indicative words and terms were determined,and they were gathered together under Indicative Feature Sets foreach topic. Components in this cluster can be a single name as wellas a phrase. An Indicative Feature Set example is given concerningsports in Table 5. While constituting an indicative cluster related toany topic, all key words which can be indicative for topic were ex-tracted. In addition to this, expressions which are close in meaningor directly related to each other were gathered in a single line.

The high dimensionality of text dataset affects negatively clas-sifier algorithms. Therefore, determining words which are relatedto constitution of Indicative Feature Set is of importance. Indicativecluster for each topic should not be too long or too short. If thiscluster is too long, it can contain irrelevant words, and processcomplexity can increase. On the other hand if it is too short, topicis not represented sufficiently, and performance may be affectednegatively (Haichao et al., 2006; Yang, 1999; Yang & Pederson,1997). Thus, training set for each topic was carefully evaluated,and was represented using TFIDF weight scheme for classification.

In determination of topic process, firstly, clusters in whichindicative features named as Indicative Feature Sets were consti-tuted was composed. While composing these clusters, texts chosenas training set were used. Conversation topic or topics were tried tobe determined using Indicative Feature Sets constituted with pro-cessing of training set and extractions obtained from this. In orderto do this, characteristics vectors used in training dataset were ex-tracted. With the aid of Indicative Feature Set, preset topic or topicswith which the spoken subject overlapped were determined. In

”.

Term 3 . . .

ball) Maç (match)otball player) Kaleci (goalkeeper)idfield) Defans (defence)

alty) Ofsayt (offside)

!) Laa (buddy!)Yaws (why!)a.g (fuck)Aptal (stupid)

Page 5: Chat mining: Automatically determination of chat conversations’ topic in Turkish text based chat mediums

Fig. 1. Feature selection and topic detection scheme.

Ö. Özyurt, C. Köse / Expert Systems with Applications 37 (2010) 8705–8710 8709

that way, conversation topic was tried to be determined by meansof classifiers. In topic determination process, single or multiple to-pic determination process was conducted.

The scheme related to the constitution of Indicative Feature Setsand determination of topics by means of test set is presented inFig. 1.

5. Results

In this study the aim was determination of conversation topic inchat conversations. Supervised learning techniques were used soas to determine conversation topic of chat conversations.

Of the 154 conversations gathered from internet medium, 75were used as training set while the rest 79 were used as test data-set. While figuring out training set, Indicative Feature sets com-posed for each topic were used which are. As for the test process,characteristics vectors of any conversation were extracted, andthe preset conversations to which they would be included weredetermined. For the classification, weka was used that open sourceclassifier software (Witten & Frank, 2000). Results of classificationare given in Table 6.

When the results are examined, it is seen that the best resultswere obtained from SVM. In the same way, it is seen that Slang/Swearword is the topic having the highest accuracy rate. In this to-pic, a classification of 92% percentage accuracy was achieved. Mainreason behind this is the fact that conversations of this topic aremore indicative.

Table 6Result of conversation topic determination.

Naïve Bayes (%) k-NN (%) SVM (%)

Sport topic 86.4 88.6 87.1Love/marriage topic 84.5 85.1 86.3Education topic 86.9 87.3 87.7Slang/swearword topic 89.8 90.2 91.7Entertainment topic 86.5 87.3 87.7Other topic 87.4 88.4 89.6

6. Conclusion and future work

In this study, a classification based on supervised learning con-cerning automatic determination of topic or topics mentioned inconversations was made. Chat conversations gathered from inter-net mediums were examined, and a classification process wasmade in order to determine conversation topic in these conversa-tions. According to data gathered from conversations, the class towhich the conversations belonged was tried to be determined. Inclassification process, a success in accuracy proportion of 92%was realized.

So as to determine topic or topics, firstly, the process of deter-mining the number of topic spoken in conversation was gonethrough. This process is a subject which should be studied inde-pendently. Threads and endings of the topics were tried to bedetermined by making analysis at basic-levels. In oncoming stud-ies, advanced analysis concerning determination of threads andendings of topics can be made. Thereby, threads and endings oftopics can be determined at high rate of accuracy.

References

Amasyali, M. F., & Diri, B. (2006). Automatic Turkish text categorization in terms ofauthor. Genre and gender. In 11th international conference on applications ofnatural language to information systems. NLDB 2006 (pp. 221–226).

Bengel, J., Gauch, S., Mitter, E., & Vijayaraghavan, R. (2004). Chattrack: Chat roomtopic detection using classification. Lecture Notes in Computer Science, 3073,266–277.

Bing, L., Xiaoli, L., Wee, S. L., & Philip, S. Y. (2004). Text classification by labelingwords. In Nineteenth national conference on artificial intelligence (pp. 425–430).

Bingham, E., Kab, A., & Girolami, M. (2003). Topic identification in dynamic text bycomplexity pursuit. Neural Processing Letters, 17, 69–83.

Elnahrawy, E. (2002). Log-based chat room monitoring using text categorization: Acomparative study. In Proceedings of the IASTED international conference oninformation and knowledge sharing. St. Thomas: US Virgin Islands.

Haichao, D., Siu, C. H., & Yulan, H. (2006). Structural analysis of chat messages fortopic detection. Online Information Review, 30(5), 496–516.

Han, E., Karypis, G., & Kumar, V. (2001). Text categorization using weight adjustedk-nearest neighbor classification. Lecture Notes in Computer Science, 2035,53–65.

Han, J., & Kamber, M. (2006). Data mining concepts and techniques. New York:Morgan Kaufmann.

Page 6: Chat mining: Automatically determination of chat conversations’ topic in Turkish text based chat mediums

8710 Ö. Özyurt, C. Köse / Expert Systems with Applications 37 (2010) 8705–8710

Joachims, T. (1998). Text categorization with support vector machines: Learningwith many relevant features. Lecture Notes in Computer Science, 1398,137–142.

Khan, F. M., Fisher, T. A., Shuler, L. A., Tianhao, W., & Pottenger, W. M. (2002). Miningchat-room conversations for social and semantic interactions. Lehigh UniversityTechnical Report, LU-CSE-02-011.

Kolenda, T., Hansen, L. K., & Larsen, J. (2001). Signal detection using ICA: Applicationto chat room topic spotting. In Proceedings of the 3rd international conference onindependent component analysis and signal separation (ICA2001) (pp. 540–545).

Koppel, M., Argamon, S., & Shimoni, A. R. (2002). Automatically categorizing writtentexts by author gender. Literary and Linguistic Computing, 17(4), 401–412.

Kose, C., & Ozyurt, O. (2006). A target oriented agent to collect specific informationin a chat medium. Lecture Notes in Computer Science, 4263, 697–706.

Kose, C., Ozyurt, O., & Ikibas, C. (2008). A comparison of textual data miningmethods for sex identification in chat conversations. Lecture Notes in ComputerScience, 4993, 638–643.

Tianhao, W., Khan, F. M., Fisher, T. A., Shuler, L. A., & Pottenger, W. M. (2002). Error-driven boolean-logic-rule-based learning for mining chat-room conversations.Lehigh University Technical Report, LU-CSE-02-008.

Yang, Y. (1999). An evaluation of statistical approaches to text categorization.Information Retrieval Journal, 1(2), 69–90.

Yang, Y., & Pederson, J. O. (1997). A comparative study on feature selection in textcategorization. In Proceedings of the fourteenth international conference onmachine learning (pp. 412–420).

Witten, I. A., & Frank, E. (2000). Data mining: Practical machine learning tools andtechniques with Java implementations. New York: Morgan Kaufmann (Chapter 8).