26
Raghava Mukkamala and Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk) Department of IT Management Copenhagen Business School Phone: +45-4185-2299 Email: [email protected] Web: http://www.cbs.dk/en/staff/rrmitm Pre-ICIS Workshop on Text Mining as a Strategy of Inquiry in Information Systems Research Sunday, 11-December-2016, Dublin, Ireland 1

Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

Embed Size (px)

Citation preview

Page 1: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

RaghavaMukkamala andRaviVatrapu

CentreforBusinessDataAnalytics(bda.cbs.dk)DepartmentofITManagementCopenhagenBusinessSchool

Phone:+45-4185-2299Email:[email protected]

Web:http://www.cbs.dk/en/staff/rrmitm

Pre-ICISWorkshoponTextMiningasaStrategyofInquiryinInformationSystemsResearchSunday,11-December-2016,Dublin,Ireland

1

Page 2: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

Motivation

• AutomatedTextAnalysishasgainedprominenceasitcansubstantiallyreducesthecostsofanalyzinglargevolumesoftext

• NewBigSocialDataAnalyticstechniquesincreasingusingtextanalysisaspartoftheirmainstreamanalysis• duetoubiquitoususeofsocialmediaplatformsbymillionsofusers• hugeamountsofcontent(includingtext)accumulatedcontinuously

• Tofindouttheusers’opinions,emotionsetc.fromtextdocuments

2

Page 3: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

BigSocialDataAnalytics– OverallMethodology

[IEEEBigdataCongress2014],[IEEEAccess2016]

Page 4: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

TextClassification• Classification:Assigningtextdocuments*topredefinedcategories• Category:Asetoflabelsmodelingadomainspecificconcept• E.g.SentimentorEmotion1 Analysis• SocialInfluence2:Reciprocity,CommitmentandConsistency,SocialProof,Authority,Liking,Scarcity

• ClassifyingdocumentsintoKnownCategories:• DictionaryMethods(orLexicon-basedmodels)• SupervisedLearningMethods

• Discovering(Unknown)CategoriesandTopics• TopicModeling

41) Ekman,Paul."Anargumentforbasicemotions."Cognition&emotion 6.3-4(1992):169-200.2) Cialdini,RobertB.Influence.Vol.3.A.Michel,1987.

*Textdocument:unitoftext,whichcouldbeoneormorewords/sentences/paragraphs

Page 5: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

DictionaryMethods• DictionaryMethods:• Userate/proportionatwhichkeywordsappearindocuments• Usesalistofwordswithscorestofindoutthedocumentcategorylabel

• Boring:-1,Disgust:-3,inspire:+2,masterpiece:+5• Limitedtocategoriesforwhichdictionariesareavailable(Sentiment,Emotionetc.)• domainspecifici.e.accuracydependsondomainfromwhichwordsaretaken

• Usageofword{crude}in“crudeoil”vs“acrudejoke”• Validationofdictionariesisbithard(whenweusethem,wedon’tknowhowmuchaccuracywewillget)

5

Page 6: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

SupervisedLearningMethods

• SupervisedTextclassification• Usemanuallyencodedtrainingsets(documentswithlabels)bydomainexperts• Canbeusedwithanydomainspecificmodelorcategory• Moreaccurateresultsthandictionarybasedapproaches• Validationiseasyastheyprovideperformancemeasures• Drawback:Preparingtrainingsetsmightbeanexpensivetask

• Applications:Spamdetection,Age/genderidentification,Languageidentification,Sentimentanalysisandsoon

6

Page 7: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

AmbiguitymakesNLPhard

RealNewspaperheadlines• TeacherStrikesIdleKids• #1Theteacherisonstrike,whichidlesthekids.• #2Ateacherstrikeskidswhoareidle

• BanonNudeDancingonGovernor'sDesk• #1Banon[NudeDancingonGovernor’sDesk]• #2[BanonNudeDancing]onGovernor’sDesk

• Iftextcontainsambiguity,theclassificationsaccuraciesmayvary

7DanJurafsky andChristopherManning.NaturalLanguageProcessing(Coursera- StanfordUniversity)https://www.coursera.org/course/nlp

Page 8: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

MUTATO:Frontend

8

Page 9: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

MUTATO:Architecture

9

Domain Expert

Global Perspective documents: 21

Text Extraction

Text Preprocessing

Word Frequency Analysis

Collocation Analysis

Factors with search words

Keyword Analysis

Text corpus for Training set

Training Set with Labels

Keyword counts

Classifier Training

Classified Texts with Labels

TextAnalyst

Multi-dimensional Text Classification Tool

Text Mining/Topic Modeling,

Text Classification

Text Corpus (social data,

documents and etc)

Search words

Text

Domain Experts coding

training set

Training Data Set

with Models

Natural Language Toolkit (NLTK),

Gensim,Python, ASP.Net

Classification Performance Measures

Accuracy, Precision, Recall, F-Meaasure

Inter Coder Agreement

Inter-rater Agreement,

Cohen's Kappa

Performance Measures

Results

Keyword Analysis

Keyword Counts, Most prominent

Categories

Word Frequency Analysis

Most Frequent Words, Frequency

Distributions

Text Classification

Multi-label Domain Specific classified Texts

Collocation Analysis

Bigrams, Trigrams and N-grams

Models Topic Modeling

Discovering Topics and Categories

LDIC2016(BestPaperNomination)

Page 10: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

TextMining(Unsupervised)

• KeywordAnalysis:KeywordcountsusingNaturalLanguageToolkit(NLTK)• WordFrequencyAnalysis:Frequentoccurringwordsfromagiventextcorpus,byusingthetermdocumentmatrix.(e.g.Top100mostfrequentwords)• CollocationAnalysis:Collocationsareexpressionsofmultiplewords,whichcommonlyco-occurinthedocuments• providesinsightsaboutdocumentsbyprovidingbigrams,trigramandn-gramsthatcontainwords,whichco-occurinthedocuments.

10

Page 11: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

TopicModeling(Unsupervised)

• Toidentify/discovertopicsandinformationpatternsintext• Clusteringtechniquestogroupthewordsbasedonsimilaritydistances• ToolbasedonGensim1 library+Python

111)Gensim,topicmodelingforhumanshttps://radimrehurek.com/gensim/

Page 12: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

TextClassification• Input

• adocumentd• afixedsetofclassesC={c1,c2,…,cJ}• Atrainingset(ofsizem)hand-labeleddocuments{(d1,c1),....,(dm,cm)}

• Output:• alearnedortrainedclassifierγ:dàc|c∊C

• ClassifiersusingNaiv̈eBayesAlgorithm• Alternatives:Logisticregression,Support-vectormachines,NeuralNetworks

• Naiv̈eBayesClassifier• BasedonBayesruleofconditionalprobabilities• Bagofwordsapproach• Requiresmanuallycodedtrainingsetsbydomainexperts

12

Page 13: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

TrainingSets– ManualCoding

• SystematicapproachformanualcontentanalysissuggestedbyRebeccaMorris[1]• ReliabilityCohen’sKappavalue:

• po=0.16+0.31+0.41=0.88

• pc=(0.20×0.21)+(0.37×0.34)+(0.43×0.45)=0.362.

13

1.R.Morris,“Computerizedcontentanalysisinmanagementresearch:Ademonstrationofadvantages&limitations,”JournalofManagement,vol.20,no.4,pp.903–931,1994.2,3

Page 14: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

ModelTrainingTool

• https://textmining.cbs.dk/TextClassification/ClsssifyTextModels.aspx

14

Page 15: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

“Heres anidea.Ifyouliketheirfoodeatthere.Ifyoudont liketheirfoodeatsomewhereelseormakeyourownmeal.Ireallydont understandwhatthebigdealis.”

User Consumer

Organisation

SocialInfluence

Domain-SpecificClassifier#01:Marketing

Page 16: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

“Brazilianhighwaytransportshowcasesaseriesofpositivefeaturessuchasflexibility,availability,andspeed.However,whencomparedtoothermodes,itbearslimitationssuchaslowproductivity,lowenergyefficiency,andlowsafetyindices.

Domain-SpecificClassifier#02:OperationsResearch

Page 17: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

“Whatthispostissaying:Someobesepeopledon'tsufferfromType2Diabetes..Whatthispostisn'tsaying:Obesitydoesn'tcauseType2Diabetes..Youcanbehealthyandobese.”

Domain-SpecificClassifier#03:PublicHealth

Page 18: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

Classifier• UsingNaturalLanguageToolkitandPython

• CustomPythonscript(~1000lines)usedfortraining&classificationofthetexts

• MUTATO1.0automatesthewholeprocessasatool

18

Page 19: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

PerformanceMeasures

19

Page 20: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

ToolStatistics

20

• Languagessupported:English,Danish,Norwegian,Swedish,[Finnish]

• ClassificationDonefor• 20BDA/BSDAstudentprojectswithvarietyofdatasets:H&M,DanskeBank,Volkswagencrisis,Skavlan Talkshow,TV2Norwayetc.

• ~10Mastersthesisprojects:Patient-journey,Jabra-Classification,Skat data,SASvsNorwegianAirlines,TransportationLogistics,CouchsurfingFBdata

• ResearchArticles:12

Page 21: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

21

ResearchPublications:TextAnalytics

IEEEEDOC2014 IEEEBigData2016 IEEEBigData2016

Page 22: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

ResearchPublications:SocialMediaCrisis

22

IEEEBigDataCongress2015 IEEEEDOC2015 IEEEAccessJournal

Page 23: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

23ACMMindtrek 2016HICSS2016

ResearchPublications:Crowdfunding&Crowdsourcing

Page 24: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

24ICTH2016 IEEEHealthCom 2016

ResearchPublications:PublicHealth

Page 25: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

25LDIC2016(BestPaperNomination) WCTR2016

ResearchPublications:OperationsResearch

Page 26: Raghava Mukkamalaand Ravi Vatrapu - Copenhagen ...blog.cbs.dk/raghavamukkamala/wp-content/uploads/2016-12...Raghava Mukkamalaand Ravi Vatrapu Centre for Business Data Analytics (bda.cbs.dk)

FutureResearch

26

• TextSummarizationTechniquesforAsynchronousCommunication• Danish,NorwegianandFinnishLanguages• Discourseanalysisforasynchronouscommunication(suchasblogs,socialmedia)• BasedonHiddenMarkovmodels andgraphoptimizationtechniques• usingIntra-sententialRhetoricalParseTreeandaspect-baseddiscoursetrees

ThankYou [email protected],[email protected]