Raghava Mukkamalaand Ravi Vatrapu - Copenhagen...

Preview:

Citation preview

RaghavaMukkamala andRaviVatrapu

CentreforBusinessDataAnalytics(bda.cbs.dk)DepartmentofITManagementCopenhagenBusinessSchool

Phone:+45-4185-2299Email:rrm.itm@cbs.dk

Web:http://www.cbs.dk/en/staff/rrmitm

Pre-ICISWorkshoponTextMiningasaStrategyofInquiryinInformationSystemsResearchSunday,11-December-2016,Dublin,Ireland

1

Motivation

• AutomatedTextAnalysishasgainedprominenceasitcansubstantiallyreducesthecostsofanalyzinglargevolumesoftext

• NewBigSocialDataAnalyticstechniquesincreasingusingtextanalysisaspartoftheirmainstreamanalysis• duetoubiquitoususeofsocialmediaplatformsbymillionsofusers• hugeamountsofcontent(includingtext)accumulatedcontinuously

• Tofindouttheusers’opinions,emotionsetc.fromtextdocuments

2

BigSocialDataAnalytics– OverallMethodology

[IEEEBigdataCongress2014],[IEEEAccess2016]

TextClassification• Classification:Assigningtextdocuments*topredefinedcategories• Category:Asetoflabelsmodelingadomainspecificconcept• E.g.SentimentorEmotion1 Analysis• SocialInfluence2:Reciprocity,CommitmentandConsistency,SocialProof,Authority,Liking,Scarcity

• ClassifyingdocumentsintoKnownCategories:• DictionaryMethods(orLexicon-basedmodels)• SupervisedLearningMethods

• Discovering(Unknown)CategoriesandTopics• TopicModeling

41) Ekman,Paul."Anargumentforbasicemotions."Cognition&emotion 6.3-4(1992):169-200.2) Cialdini,RobertB.Influence.Vol.3.A.Michel,1987.

*Textdocument:unitoftext,whichcouldbeoneormorewords/sentences/paragraphs

DictionaryMethods• DictionaryMethods:• Userate/proportionatwhichkeywordsappearindocuments• Usesalistofwordswithscorestofindoutthedocumentcategorylabel

• Boring:-1,Disgust:-3,inspire:+2,masterpiece:+5• Limitedtocategoriesforwhichdictionariesareavailable(Sentiment,Emotionetc.)• domainspecifici.e.accuracydependsondomainfromwhichwordsaretaken

• Usageofword{crude}in“crudeoil”vs“acrudejoke”• Validationofdictionariesisbithard(whenweusethem,wedon’tknowhowmuchaccuracywewillget)

5

SupervisedLearningMethods

• SupervisedTextclassification• Usemanuallyencodedtrainingsets(documentswithlabels)bydomainexperts• Canbeusedwithanydomainspecificmodelorcategory• Moreaccurateresultsthandictionarybasedapproaches• Validationiseasyastheyprovideperformancemeasures• Drawback:Preparingtrainingsetsmightbeanexpensivetask

• Applications:Spamdetection,Age/genderidentification,Languageidentification,Sentimentanalysisandsoon

6

AmbiguitymakesNLPhard

RealNewspaperheadlines• TeacherStrikesIdleKids• #1Theteacherisonstrike,whichidlesthekids.• #2Ateacherstrikeskidswhoareidle

• BanonNudeDancingonGovernor'sDesk• #1Banon[NudeDancingonGovernor’sDesk]• #2[BanonNudeDancing]onGovernor’sDesk

• Iftextcontainsambiguity,theclassificationsaccuraciesmayvary

7DanJurafsky andChristopherManning.NaturalLanguageProcessing(Coursera- StanfordUniversity)https://www.coursera.org/course/nlp

MUTATO:Frontend

8

MUTATO:Architecture

9

Domain Expert

Global Perspective documents: 21

Text Extraction

Text Preprocessing

Word Frequency Analysis

Collocation Analysis

Factors with search words

Keyword Analysis

Text corpus for Training set

Training Set with Labels

Keyword counts

Classifier Training

Classified Texts with Labels

TextAnalyst

Multi-dimensional Text Classification Tool

Text Mining/Topic Modeling,

Text Classification

Text Corpus (social data,

documents and etc)

Search words

Text

Domain Experts coding

training set

Training Data Set

with Models

Natural Language Toolkit (NLTK),

Gensim,Python, ASP.Net

Classification Performance Measures

Accuracy, Precision, Recall, F-Meaasure

Inter Coder Agreement

Inter-rater Agreement,

Cohen's Kappa

Performance Measures

Results

Keyword Analysis

Keyword Counts, Most prominent

Categories

Word Frequency Analysis

Most Frequent Words, Frequency

Distributions

Text Classification

Multi-label Domain Specific classified Texts

Collocation Analysis

Bigrams, Trigrams and N-grams

Models Topic Modeling

Discovering Topics and Categories

LDIC2016(BestPaperNomination)

TextMining(Unsupervised)

• KeywordAnalysis:KeywordcountsusingNaturalLanguageToolkit(NLTK)• WordFrequencyAnalysis:Frequentoccurringwordsfromagiventextcorpus,byusingthetermdocumentmatrix.(e.g.Top100mostfrequentwords)• CollocationAnalysis:Collocationsareexpressionsofmultiplewords,whichcommonlyco-occurinthedocuments• providesinsightsaboutdocumentsbyprovidingbigrams,trigramandn-gramsthatcontainwords,whichco-occurinthedocuments.

10

TopicModeling(Unsupervised)

• Toidentify/discovertopicsandinformationpatternsintext• Clusteringtechniquestogroupthewordsbasedonsimilaritydistances• ToolbasedonGensim1 library+Python

111)Gensim,topicmodelingforhumanshttps://radimrehurek.com/gensim/

TextClassification• Input

• adocumentd• afixedsetofclassesC={c1,c2,…,cJ}• Atrainingset(ofsizem)hand-labeleddocuments{(d1,c1),....,(dm,cm)}

• Output:• alearnedortrainedclassifierγ:dàc|c∊C

• ClassifiersusingNaiv̈eBayesAlgorithm• Alternatives:Logisticregression,Support-vectormachines,NeuralNetworks

• Naiv̈eBayesClassifier• BasedonBayesruleofconditionalprobabilities• Bagofwordsapproach• Requiresmanuallycodedtrainingsetsbydomainexperts

12

TrainingSets– ManualCoding

• SystematicapproachformanualcontentanalysissuggestedbyRebeccaMorris[1]• ReliabilityCohen’sKappavalue:

• po=0.16+0.31+0.41=0.88

• pc=(0.20×0.21)+(0.37×0.34)+(0.43×0.45)=0.362.

13

1.R.Morris,“Computerizedcontentanalysisinmanagementresearch:Ademonstrationofadvantages&limitations,”JournalofManagement,vol.20,no.4,pp.903–931,1994.2,3

ModelTrainingTool

• https://textmining.cbs.dk/TextClassification/ClsssifyTextModels.aspx

14

“Heres anidea.Ifyouliketheirfoodeatthere.Ifyoudont liketheirfoodeatsomewhereelseormakeyourownmeal.Ireallydont understandwhatthebigdealis.”

User Consumer

Organisation

SocialInfluence

Domain-SpecificClassifier#01:Marketing

“Brazilianhighwaytransportshowcasesaseriesofpositivefeaturessuchasflexibility,availability,andspeed.However,whencomparedtoothermodes,itbearslimitationssuchaslowproductivity,lowenergyefficiency,andlowsafetyindices.

Domain-SpecificClassifier#02:OperationsResearch

“Whatthispostissaying:Someobesepeopledon'tsufferfromType2Diabetes..Whatthispostisn'tsaying:Obesitydoesn'tcauseType2Diabetes..Youcanbehealthyandobese.”

Domain-SpecificClassifier#03:PublicHealth

Classifier• UsingNaturalLanguageToolkitandPython

• CustomPythonscript(~1000lines)usedfortraining&classificationofthetexts

• MUTATO1.0automatesthewholeprocessasatool

18

PerformanceMeasures

19

ToolStatistics

20

• Languagessupported:English,Danish,Norwegian,Swedish,[Finnish]

• ClassificationDonefor• 20BDA/BSDAstudentprojectswithvarietyofdatasets:H&M,DanskeBank,Volkswagencrisis,Skavlan Talkshow,TV2Norwayetc.

• ~10Mastersthesisprojects:Patient-journey,Jabra-Classification,Skat data,SASvsNorwegianAirlines,TransportationLogistics,CouchsurfingFBdata

• ResearchArticles:12

21

ResearchPublications:TextAnalytics

IEEEEDOC2014 IEEEBigData2016 IEEEBigData2016

ResearchPublications:SocialMediaCrisis

22

IEEEBigDataCongress2015 IEEEEDOC2015 IEEEAccessJournal

23ACMMindtrek 2016HICSS2016

ResearchPublications:Crowdfunding&Crowdsourcing

24ICTH2016 IEEEHealthCom 2016

ResearchPublications:PublicHealth

25LDIC2016(BestPaperNomination) WCTR2016

ResearchPublications:OperationsResearch

FutureResearch

26

• TextSummarizationTechniquesforAsynchronousCommunication• Danish,NorwegianandFinnishLanguages• Discourseanalysisforasynchronouscommunication(suchasblogs,socialmedia)• BasedonHiddenMarkovmodels andgraphoptimizationtechniques• usingIntra-sententialRhetoricalParseTreeandaspect-baseddiscoursetrees

ThankYou rrm.itm@cbs.dk,rv.itm@cbs.dk

Recommended