Detec%ngDecep%oninWri%ngStyle
SadiaAfroz,MichaelBrennanandRachelGreenstadt.Privacy,SecurityandAutoma%onLab
DrexelUniversity
Overview
• Authorshiprecogni%on• Authorshiprecogni%oninadversarialenvironment
• Decep%ondetec%on• Experimentsondifferentdatasets
Authorshiprecogni%on
Whowrotethedocument?
Authorshiprecogni%on
Stylometry:– Anauthorshiprecogni%onsystembasedsolelyonwri%ngstyle.
– Nothandwri%ng– Onlylinguis%cstyle:wordchoice,sentencelength,parts‐of‐speechusage,…
Whyitworks?
• Everybodyhaslearnedlanguagedifferently
Howregularauthorshiprecogni%onworks
MachineLearningSystem
ExtractfeaturesMachineLearning
System
MachineLearningSystem
Documentofunknownauthorship
Extractfeatures
Determineauthorship
Assump%ons
• Wri%ngstyleisinvariant.
– It’slikeafingerprint,youcan’treallychangeit.
– Authorshiprecogni%oncaniden%fyyouiftherearesufficientwri%ngsamplesandasetofsuspects.
AdversarialAUacks
• Imita%onorframingaUack– Whereoneauthorimitatesanotherauthor– Par%cipantswereaskedtoimitateCormacMcCarthyinwri%ngabouttheirday
• Obfusca%onaUack– Whereanauthorhideshisregularstyle– Par%cipantswereaskedtowriteabouttheirneighborhoodinadifferentstyle
M.BrennanandR.Greenstadt.Prac%calaUacksagainstauthorshiprecogni%ontechniques.InProceedingsoftheTwenty‐FirstConferenceonInnova%veApplica%onsofAr%ficialIntelligence(IAAI),Pasadena,CA,2009.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20 25 30 35 40
NumberofAuthors
Accuracyindetec6ngauthorshipofregulardocuments
9‐Feature(NN)
Synonym‐Based
WriteprintsBaseline(SVM)
Random
Morethan80%accurateindetec%ngauthorshipwith40authorsinregular
document
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20 25 30 35 40
NumberofAuthors
Accuracyindetec6ngauthorshipofObfuscateddocuments
9‐Feature(NN)
Synonym‐Based
WriteprintsBaseline(SVM)
Random
Accuracyislessthanrandomchanceinobfuscateddocuments
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20 25 30 35 40
NumberofAuthors
Accuracyindetec6ngauthorshipofImitateddocuments
9‐Feature(NN)
Synonym‐Based
WriteprintsBaseline(SVM)
Random
Accuracyislessthanrandomchanceinimitateddocuments
CanwedetectStylis%cDecep%on?
Regular
Obfuscated
Imitated
Extended‐Brennan‐GreenstadtCorpus
• 56authors– 12ofthepar%cipantsarefromDrexelUniversity– TherestarepaidworkersfromAmazonmechanicalturk
• Threekindsofwri%ngsamples– Regularwri%ngs(5000words)– Imitatedwri%ng
• A500‐wordar%cledescribingaday• ImitateCormacMcCarthyfrom`TheRoad’
– Obfuscatedwri%ng• A500‐wordar%cledescribingneighborhood• Hideownwri%ngstyle
Detec%ngstylis%cdecep%onispossible
98
8589.5
95.775.3
59.9
94.5
4843
0
10
20
30
40
50
60
70
80
90
100
Regular Imita%on Obfusca%on
Writeprint,SVM
Lying‐detec%on,J48
9‐featureset,J48
‐80 ‐60 ‐40 ‐20 0 20 40 60 80 100
Averagesentencelength
Gunning‐Fogreadabilityindex
Cardinalnumber
Adjec%ve
Averagewordlength
Averagesyllablesperword
Existen%althere
Adverb
Uniquewords
Verb
ShortWords
Par%cle
Sentencecount
Personalpronoun
Imita%on
Obfusca%on
FeatureChangesinImita6onandObfusca6on
Problemwiththedataset:TopicSimilarity
• Allthedecep%vedocumentswereofsametopic.
• Non‐content‐specificfeatureshavesame
effectascontent‐specific
features.!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!#*"
!#+"
!#,"
$"
-.-/0123" 4567804" 29:7;<0123"
!"#
$%&'($)
*+,$($-.)/(+0-1)2%#34$&)
5,$6.)78)9+,$($-.)8$%.'($)&$.)+-)9$.$60-1)
%9:$(&%(+%4)%'.;7(&;+3)
=>3/0<1<"
?5@-<08"
A23/53/"
Hemingway‐FaulknerImita%onCorpus
• Ar%clesfromtheInterna%onalImita%onHemingwayContest(2000‐2005)
• Ar%clesfromtheFauxFaulknerContest(2001‐2005)
• OriginalexcerptsofErnestHemingwayandWilliamFaulkner
Decep%ondetec%onispossibleevenwhenthetopicisnotsimilar
• 81.2%accurateindetec%ngimitateddocuments.
Longtermdecep%on:AGayGirlInDamascus
– Originalauthorwasa40‐yearoldAmericanci%zen,ThomasMacMaster.
– PretendedtobeaSyriangaywoman,AminaArraf.– Theauthorworkedforatleast5yearstocreateanewstyle.
ThomasMacMaster.FakepictureofAminaArraf.
Longtermdecep%onishardtodetect
• Noneoftheblogpostswerefoundtobedecep%ve.
• Butregularauthorshiprecogni%oncanhelp.• WetriedtoaUributeauthorshipoftheblogpostsusingThomas(ashimself),Thomas(asAmina),BriUa(Thomas’swife).
Longtermdecep%onAuthorshiprecogni%onoftheblog
posts
54% 43% 3%
ThomasMacMaster. AminaArraf BriUa(Thomas’swife)
Futureworks
• Intrusiondetec%on• Socialspamdetec%on
• Iden%fyingqualitydiscourse
TwoTools
• JStylo:AuthorshipRecogni%onAnalysisTool.• Anonymouth:AuthorshipRecogni%onEvasionTool.
• Free,OpenSource.(GNUGPL)• AlphareleasesavailabletodayathUps://psal.cs.drexel.edu– Migra%ngtoGitHubsoon.
Privacy,SecurityandAutoma%onLab(hUps://psal.cs.drexel.edu)
• Faculty– Dr.RachelGreenstadt
• GraduateStudents– SadiaAfroz(Decep%onDetec%onLead)– DiamondBishop– MichaelBrennan– AylinCaliskan– ArielStolerman(JStyloLeadDeveloper)
• UndergraduateStudents– PavanKantharaju– AndrewMcDonald(AnonymouthLeadDeveloper)