45
Author profiling: more linguistics and explanation Ben Verhoeven In close collaboration with Walter Daelemans Presented as a JOTA Lecture Ljubljana, Slovenia 22 November 2016

Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Authorprofiling:morelinguisticsandexplanation

BenVerhoevenInclosecollaborationwithWalterDaelemans

PresentedasaJOTALectureLjubljana,Slovenia22November2016

Page 2: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Text miningThreelayers ofinformationintext

• Objective– Facts,concepts,characteristics ofconcepts,relationsbetween

concepts,...– Who doeswhat,where,how and why?

• Subjective– Opinion,sentiment,emotion,...– Who believes what about what?

• Metadata- Profile– Age,gender,region,...– What doweknow about the author?

Page 3: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Example

• Objective– Who?Damjan Popič– Did what?Presented atTEDxKranj onthe Janes project

• Subjective– What?Thepresentation wassuccesful– Who believes this?Darja Fišer

Page 4: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Example

• Metadata– Profile– Who?Darja Fišer– Age?Mid-thirties– Gender?Female– Personality?Extravert– Education?Highly educated

Page 5: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Stylometry

Thequantitativestudyofstylisticcharacteristicsofatext

WritingstyleAcombinationofinvariantandunconsciousdecisionsinlanguageproductiononalllinguisticlevels,uniquelyassociatedwithspecificauthorsorgroupsofauthors

→HumanStylome Hypothesis(VanHalteren etal.2005)

Page 6: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Computationalstylometry

• Authorshipidentification– Attribution- attributetexttooneoflimitedsetofauthors– Verification- isunknowntextwrittenbygivenauthor?

• Authorprofiling– Predictionofsociologicalorpsychologicalcharacteristicsofanauthor

Page 7: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Textcategorization

• Classrepresentation• Documentrepresentation(features)• Supervisedmachinelearningmethod

Page 8: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Classrepresentation

Authorprofiling• Age(e.g.10s,20s,30s,40s,…)• Gender(e.g.malevs.female)• Location• Personality• Education• Ideology• Mentalhealth

Page 9: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

BriefcatalogueoffeaturesNumeric• Complexity,readability• Vocabularyrichness

– Type-tokenratio– Hapax legomena

• Averagesordistributionsof– Syllablelength– Wordlength– Sentencelength

Character-level• Letterfrequency• Punctuation• Spellingerrors• Charactern-grams

Page 10: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

BriefcatalogueoffeaturesWord-level• Wordn-grams• Specialdictionaries• Morphology:prefixesandsuffixes

Syntax• Part-of-speechdistributions• Frequenciesofsyntacticchunks(e.g.NP=Det +Adj +N)

Page 11: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Whichdocuments?

Datawithassociatedclassesneededtotrainaclassifier.

Page 12: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Notthatmanyexistingresources(especiallyforDutch)

Issues• Authorialprofilecanbehardtoget• Notallfreelyavailable– Non-disclosureagreements– Anonymization problems

• Nonehavemorethan2kindsofmeta-data

Page 13: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Whydowewantallmeta-data?

• Allaspectshaveaninfluenceontheauthor’swritingstyle

• Moreimportantly:theseaspectsarereflectedinthesamekindoffeatures– E.g.pronouns(Pennebaker,2011)

• Solutions:– controlforsomeaspects– balancethedata– takeallaspectsintoaccount

Page 14: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Someresourcesforpersonality• Essaysdataset(Pennebaker,laterMairesse)– Englishstream-of-consciousnesstextsbystudents

• myPersonality (Stillwell&Kosinski)– Large-scaledatacollectionthroughFacebookapp,manylanguages

• Personae(Luyckx &Daelemans)– Dutchessays,writtenbystudents

• CSICorpus(Verhoeven&Daelemans)– Dutchpapers,essaysandreviewswrittenbystudents

• TwiSty Corpus(Verhoeven,Daelemans &Plank)– MultilingualTwitterstylometry corpus

Page 15: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

CLiPS Stylometry Investigation(CSI)

• Corpusintwogenres:essaysandreviews• Largeamountofmeta-data• Multitudeofpurposes–Mostlyincomputationalstylometry

• Freelyavailable• Yearlyexpansion

Page 16: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

CSICorpus

Authormeta-data• Age• Gender:male/female• Sexualorientation*:straightorLGBT• Regionoforigin:BelgianprovincesorTheNetherlands

• Personalityprofile:BigFiveandMBTI*

*Providedoptionally

Page 17: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

PersonalitytypologiesBigFive

– Opennesstoexperience

– Conscientiousness– Extraversion– Agreeableness– Neuroticity

Score0-100pertrait

MBTI(Myers-BriggsTypeIndicator)

– Extravert– Introvert– Thinking– Feeling– Sensing– iNtuition– Judging– Perceiving

Dichotomywithscore0-100

Page 18: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

CSICorpus

Documentmeta-data• Genre– Essays,papers:writtenforDutchproficiencycourse(universitylevel),formaltext

– Reviews:specialassignment• Documentinfo– Topic,sentiment,veracity(true/false)ofreviews

– Gradesofpapersandessays

Page 19: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

CSICorpus

Corpussize

Genres #docs #tokens Avg.length Std.dev.

Reviews 1298 202,827 156 65

Essays 517 565,885 1095 734

Total 1815 768,712

Page 20: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

CSICorpus

Advantages- Multiplepurposes- Yearlyexpansion- Textfromsimilarsources(withineachgenre)- Enablescross-genreexperiments

Disadvantages- Opportunisticnature(restrictedtoauthorsat

hand)influencesbalanceofmeta-data

Page 21: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

TwitterStylometry (TwiSty)

TwiSty Corpus– Large-scalemultilingualTwittercorpusforpersonalityandgender

– AllWesternEuropeanlanguagesintop20ofTwitterfrequencies,apartfromEnglish• IT,NL,DE,ES,PT,FR

Page 22: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

TwiSty Corpus

• DevelopedonideaofPlank&Hovy (2015)– Twitterminingforonlyoneweek– SearchforMBTItypesviaAPI– OnlyEnglish– Annotatinggender– Result• 1500authors• 1.2Mtweets

Page 23: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Refresher:MBTI

• Myers-BriggsTypeIndicator– Extraversionvs.Introversion– iNtuitive vs.Sensing– Thinkingvs.Feeling– Judgingvs.Perceiving

• 16Types– E.g.ESTJ,ISFP,ENTP,…

Page 24: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

TwiSty CorpusDatacollection• TwittersearchinsteadofminingthroughAPI• SearchforcombinationofeachMBTItypewithlanguage-

specificwords• DownloadHTML

Italian che, sono,fatto

Dutch ik,jij,het,persoonlijkheid

German ich,bist,Persönlickheit,dass

French suis, c’est,personnalité

Spanish soy, tengo,personalidad

Portuguese sou, personalidade

Page 25: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

TwiSty CorpusDataclean-up• Filterouttweetsthatwerenotrelevant:– Notaboutauthor

• @schrooten ok,ik heb deze testdestijds meteen uitgebreidevragenlijst opmijn werk gedaan.Meerdere vanmijn collega PM-ers zijn ESTJ...

– Ambiguityoftype• Volgens mij benik zowel INTJals ESTJ-- heteerste als ik merotvoel,hettweede als hetgoed gaat.#beetjevreemd

– Indifferentlanguage• Estj seregas muzon4ik?Het.O,nutaddavaj daj timati,etoj djdljee.;D

• Labelforgender

Page 26: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

TwiSty Corpus

• Corpussizeinprofiles

• Corpussizeintweets

DE IT NL FR PT ES

411 490 1,000 1,405 4,090 10,772

Page 27: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

TwiSty Corpus

LanguageIdentification• Manybilingual/polyglotTwitterusers• Tweet-levelidentification• Majorityvotingapproachwiththreelanguageidentifiers

Tool Authors #Langs

langid.py Lui &Baldwin (2012) 97

langdetect Nakatani (2010) 53

ldig Nakatani (2012) 17

Page 28: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

TwiSty Corpus

• Corpussizeintweets

Page 29: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Experiment

• Instances:200tweetsperuser• Preprocessing:normalizeurls,hashtags,mentionsandtokenize

• Features:characterandwordn-grams• Model:LinearSVC• Evaluation:10-foldcross-validation

Page 30: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Genderprediction

Language WRB MAJ F-score

DE 50.28 53.75 77.62

IT 54.78 65.46 73.29

NL 50.04 51.41 82.61

FR 51.84 59.60 83.80

PT 52.15 60.36 87.55

ES 51.00 57.06 87.62

Page 31: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

PersonalitypredictionLang Trait WRB MAJ F-score

DE I-E 60.22 72.61 72.27

S-N 71.03 82.43 74.49

T-F 51.16 57.62 59.03

J-P 53.68 63.57 61.99

IT I-E 65.54 77.88 77.78

S-N 75.60 85.78 79.21

T-F 50.31 53.95 52.13

J-P 50.19 53.05 47.01

NL I-E 53.02 62.28 62.90

S-N 57.66 69.57 70.49

T-F 51.47 58.59 59.95

J-P 52.00 60.00 57.99

Page 32: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

PersonalitypredictionLang Trait WRB MAJ F-score

FR I-E 54.77 65.44 66.49

S-N 68.00 80.00 78.90

T-F 50.65 55.68 58.22

J-P 52.13 60.32 56.79

PT I-E 53.36 62.97 66.69

S-N 65.60 76.08 73.42

T-F 51.27 57.98 61.62

J-P 50.87 56.61 56.53

ES I-E 50.00 50.49 61.09

S-N 55.42 66.47 61.54

T-F 51.63 59.04 59.73

J-P 51.53 58.75 56.08

Page 33: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Conclusion

• Large-scale,“opportunistic”,multilingualsocialmediacorpus

• Genderpredictionworksverywell• Personalitypredictionismoredifficult,yetpossible

Page 34: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

SloveneTwitter

• Currently,workingongenderpredictiononSlovenetweets– TwittercorpusfromJanesproject• 6,500authors• 2/3male,1/3female

Page 35: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Explanation• Lack ofeffortinstylometry to explain results,despite some great early examples

• Argamon &Koppel(2003)– Use ofpronouns (moreby women)and certain typesofnoun modification (moreby men)• ‘Male’words:a,the,that,these,one,two,more,some• ‘Female’words:I,you,she,her,their,myself,yourself,herself

– More‘relational’language by women,more‘informative/rational’language by men• Eveninformal language (non-fiction)

Page 36: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Morelinguistics

• Discourse

• Semantics

Page 37: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Discourse

• What– relationsbetweensentences– coherentstructure– situatingtextintheworld

• How– discourserelationaldevices(DRD)

Page 38: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Discourse

Features

• Dictionarywithcategoriesfordifferentkindsofdiscoursestructure

• Frequenciesofcategoriesareanapproximationoftheiruse

Page 39: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Discourse

• PennDiscourseTreebanktagset(PDTBResearchGroup,2007)

– TEMPORAL• Synchronous:terwijl• Asynchronous:alvorens,nadat

– CONTINGENCY• Cause:dankzij,want• Condition:aangezien,als

Page 40: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Discourse• PennDiscourseTreebanktagset

(PDTBResearchGroup,2007)– COMPARISON

• Contrast:oftewel• Concession:ofschoon,wanneer

– EXPANSION• Conjunction:alsook,eveneens• Instantiation:zoals• Restatement:alsof• Alternative:noch,hetzij• Exception:uitgezonderd• List:en

Page 41: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Ambiguity

• Nothingmuchchangedwhile/TIME Iwasaway.

• While/CONCESSION Iwouldn’trecommendanight-timevisit,bydaytheareaislovely.

• Onepersonwantsout,while/CONTRAST theotherwantstherelationshiptocontinue.

ÞWeighting

Page 42: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

DictionaryCreation

• PennDiscourseTreebank(PDTB):textwithannotateddiscourseconnectives–Makedictionaryofconnectiveswithweightedclasses

• Extrapolatedthisdictionarytootherlanguages– UsingmultilinguallexicaofdiscoursemarkerscreatedfromalignedEuroparl corpora

Page 43: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

DictionaryCreation

• 86EnglishseedwordsfromPDTB• Numberoftranslationsfound– Dutch:335– German:341– Slovene:299

• Onaverage2categoriesperconnective• Meanstrengthofstrongestcategory:90%

Page 44: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Ongoingresearch

• EvaluatethisdictionaryonGermanannotatedlexicon:DimLex

• ExperimentsusingdiscoursedictionariesforDutch&Englishgenderclassificationonnewscorpora

Page 45: Author profiling: more linguistics and explanation · 11/22/2016  · Personality typologies Big Five ... • Preprocessing: normalize urls, hashtags, mentions and tokenize • Features:

Hvala za pozornost!

Aliimate vprašanja?

[email protected]@verhoevenben