44
Corporate Fraud, LDA, and Econometrics DSSG 2019 March 27 Dr. Richard M. Crowley SMU Slides: [email protected] @prof_rmc rmc.link/DSSG 1

Corporate Fraud, LDA, and Econometrics

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Corporate Fraud, LDA, and Econometrics

CorporateFraud,LDA,and

Econometrics

DSSG ⋅2019March27

Dr.RichardM.Crowley

SMU

Slides:

[email protected] @prof_rmc

rmc.link/DSSG

1

Page 2: Corporate Fraud, LDA, and Econometrics

▪ Businessinsight

▪ Economictheory

▪ Psychologytheory

▪ Statistics

▪ Machinelearning

▪ Carefuleconometrics

Theproblem

▪ Detect:Classificationproblem

▪ Currently:Predictionproblem

▪ Misreporting:Theaccountingside

▪ Theapproachcombines…

Howcanwedetectifafirmiscurrently

involvedinamajorinstanceof

misreporting?

2

Page 3: Corporate Fraud, LDA, and Econometrics

Whydowecare?

▪ Theabove,basedonAuditAnalytics,ignores:

▪ GDPimpacts:Enron’scollapsecost

▪ Societalcosts:Lostjobs,economicconfidence

▪ Anynegativeexternalities,e.g.compliancecosts

▪ Inflation:Incurrentdollarsitisevenhigher

The10mostexpensiveUScorporatefrauds

costshareholders12.85BUSD

~35BUSD

Catchingeven1moreoftheseastheyhappen

couldsavebillionsofdollars

3

Page 4: Corporate Fraud, LDA, and Econometrics

WhatisMisreporting?

4 . 1

Page 5: Corporate Fraud, LDA, and Econometrics

Misreporting:Asimpledefinition

Errorsthataffectfirms’accountingstatementsor

disclosureswhichweredoneseeminglyintentionallyby

managementorotheremployeesatthefirm.

4 . 2

Page 6: Corporate Fraud, LDA, and Econometrics

Traditionalaccountingfraud

1. Acompanyisunderperforming

2. Managementcooksupsomeschemetoincreaseearnings

▪ WellsFargo(2011-2018?)

▪ Fake/duplicatecustomersandtransactions

3. Createaccountingstatementsusingthefakeinformation

4 . 3

Page 7: Corporate Fraud, LDA, and Econometrics

Otheraccountingfraudtypes

▪ Cookiejarreserve(secretpaymentsbyIntelofupto76%ofquarterly

income)

1. Thecompanyisoverperforming

2. “Saveup”excessperformanceforarainyday

3. Recognizerevenue/earningswhenneededtohitfuturetargets

▪ Optionsbackdating

▪ Relatedpartytransactions(transferring59MUSDfromthefirmto

familymembersover176transactions)

▪ Improperaccountingtreatments(Notusingmark-to-market

accountingtofairvaluestuffedanimalinventories)

▪ Goldreserveswereactually…dirt

Dell(2002-2007)

Apple(2001)

ChinaNorthEastPetroleumHoldingsLimited

CVS(2000)

CountrylandWellnessResorts,Inc.(1997-2000)

4 . 4

Page 8: Corporate Fraud, LDA, and Econometrics

Wherearethesedisclosed?(US)

1. :AccountingandAuditingEnforcementReleases

▪ Highlightlarger/moreimportantcases,writtenbytheSEC

▪ Example:TheSummarysectionof

2. 10-K/Afilings(“10-K” ⇒annualreport,“/A” ⇒amendment)

▪ Note:notall10-K/Afilingsarecausedbyfraud!

▪ Benigncorrectionsoradjustmentscanalsobefiledasa10-K/A

▪ Note:

3. BytheUSgovernmentthrougha13(b)action

4. Inanoteinsidea10-Kfiling

▪ Thesearesometimesreferredtoas“littler”restatements

5. Inapressrelease,whichislaterfiledwiththeUSSECasan8-K

▪ 8-Ksarefiledformanyotherreasonstoothough

USSECAAERs

thisAAERagainstSanofi

AuditAnalytics’write-uponthisfor2017

Originaldisclosuremotivatedbymanagementadmission,

governmentinvestigation,orshareholderlawsuit

4 . 5

Page 9: Corporate Fraud, LDA, and Econometrics

Whereareweat?

▪ Allofthemareimportanttocapture

▪ Allofthemaffectaccountingnumbersdifferently

▪ Noneoftheindividualmethodsarefrequent…

▪ Weneedtobecarefulhere(orcheckmultiplesources)

Fraudhappensinmanyways,formanyreasons

Itisdisclosedinmanyplaces.Allhavesubtlydifferent

meaningsandimplications

Thisisahardproblem!

4 . 6

Page 10: Corporate Fraud, LDA, and Econometrics

PredictingFraud

5 . 1

Page 11: Corporate Fraud, LDA, and Econometrics

Mainquestionandapproaches

▪ 1990s:Financialsandfinancialratios

▪ Misreportingfirms’financialsshouldbedifferentthanexpected

▪ Late2000s/early2010s:Characteristicsoffirmdisclosures

▪ Annualreportlength,sentiment,wordchoice,…

▪ Late2010s:Moreholistictext-basedMLmeasuresofdisclosures

▪ Modelingwhatthecompanydiscussesintheirannualreport

Howcanwedetectifafirmiscurrentlyinvolvedinamajor

instanceofmisreporting?

Allofthesearediscussedin

–IwillrefertothepaperasBCEforshort

Brown,CrowleyandElliott

(2018)

5 . 2

Page 12: Corporate Fraud, LDA, and Econometrics

Whatweneedtoaddress:

1. Detectingvariedevents

▪ “Careful”featureselection(offloadtoeconometrics)

▪ Intelligentfeaturedesign(partiallyoffloadtoML)

2. Forbusinessusers…Interpretabilitymatters

▪ Psychology-styleexperiment

▪ Andaquasi-experiment

3. Predictivemodel

▪ Needclean,outofsampledesigns+backtesting

▪ Windoweddesign–datafrom1998won’thelptoday,butitwould

in1999

4. Infrequentevents

▪ Goodforsociety,badformodeling

▪ Carefuleconometrics

5 . 3

Page 13: Corporate Fraud, LDA, and Econometrics

Mainresults

5 . 4

Page 14: Corporate Fraud, LDA, and Econometrics

Issue1:Variedevents

6 . 1

Page 15: Corporate Fraud, LDA, and Econometrics

Financialmodelbasedon

▪ 17measuresincluding:

▪ Logofassets

▪ %changeincashsales

▪ Indicatorformergers

▪ Theory:Purelyeconomic

▪ Misreportingfirms’

financialsshouldbe

differentthanexpected

▪ Perhapsmoreincome

▪ Oddcapitalstructure

Textualstylemodelbasedon

variouspapers

▪ 20measuresincluding:

▪ Lengthandrepetition

▪ Sentiment

▪ Grammarandstructure

▪ Theory:Communications

▪ Stylereflectscomplexity

andunintentionalbiases

▪ Somemeasuresadhoc

▪ Misreporting ⇒annual

reportwrittendifferently

Pastmodels

Dechow,etal.(2011)

Wetestedanadditional26financial&60stylevariables6 . 2

Page 16: Corporate Fraud, LDA, and Econometrics

TheBCEmodel

1. Retainthevariablesfromthepreviousmodelsregressions

▪ Formsausefulbaseline

2. AddinanMLmeasurequantifyinghowmucheachannualreport(~20-

300pages)talksaboutdifferenttopics

▪ Trainonwindowsoftheprior5years

▪ Balancedatastaleness,dataavailability,andquantityoftext

▪ Optimaltohave31topicsper5years

▪ Basedonin-samplelogisticregressionoptimization

▪ Fromcommunicationsandpsychology:

▪ Whenpeoplearetryingtodeceiveothers,whattheysayiscarefully

picked–topicschosenareintentional

▪ Puttingthisinabusinesscontext:

▪ Ifyouaremanipulatinginventory,youdon’ttalkaboutinventory

Whydowedothis?—Thinklikeafraudster!

6 . 3

Page 17: Corporate Fraud, LDA, and Econometrics

Whatthetopicslooklike

6 . 4

Page 18: Corporate Fraud, LDA, and Econometrics

Howtodothis:LDA

▪ LDA:LatentDirichletAllocation

▪ Widely-usedinlinguisticsandinformationretrieval

▪ AvailableinC,C++,Python,Mathematica,Java,R,Hadoop,Spark,

▪ Weused

▪ isgreatforpython; isgreatforR

▪ UsedbyGoogleandBingtooptimizeinternetsearches

▪ UsedbyTwitterandNYTforrecommendations

▪ LDAreadsdocumentsallonitsown!Youjusthavetotellithowmany

topicstofind

onlineldavb

Gensim STM

6 . 5

Page 19: Corporate Fraud, LDA, and Econometrics

Implementationdetails

1. Annualreportsareamess

▪ Fixedwidthtextfiles;properhtml;htmlexportedfromMSWord…

▪ Embeddedheximages

▪ Solution:Regexes,regexes,regexes

▪ Detailedinthepaper’swebappendix

2. Stemming,tokenizing,stopwords

3. FeedtoLDA

4. Tunehyperparameters(#oftopicsismostcrucial)

5. Finallyimplementthemodel

Theusualaddagethatdatacleaningtakesthelongeststill

holdstrue

6 . 6

Page 20: Corporate Fraud, LDA, and Econometrics

Otherconsiderations

1. LDAprovidestheweightoneachtopic,butdocumentsvaryalotby

length

▪ Solution:Normalizetoapercentagebetween0and1

2. Thereisamechanicalcomponenttotopicsduetofirms’industries

▪ Solution:Orthogonalizetopicstoindustry

▪ Runalinearregressionandretain ε :

topic = α + β Industry + ε

i,firm

i,firm

j

∑ i,j j,firm i,firm

6 . 7

Page 21: Corporate Fraud, LDA, and Econometrics

Issue2:Interpretability

7 . 1

Page 22: Corporate Fraud, LDA, and Econometrics

LDAVerification

▪ LDAiswellvalidatedongeneraltext,noquestion

▪ Onekeyistopresentsomedetailsofthetopicstoensurecomfort

▪ Anotherkeyishavingpriorevidencetofallbackon

▪ WhetherLDAworksonbusiness-specificdocumentsisnotsowell

studied

▪ Moststudiesjustaskpeoplewhethertheyagreewiththehand-

codedtopiccategorizations

Wedecidedtofillthisgap

7 . 2

Page 23: Corporate Fraud, LDA, and Econometrics

Experimentaldesign

▪ Whichworddoesn’tbelong?

1. Commodity,Bank,Gold,Mining

2. Aircraft,Pharmaceutical,Drug,Manufacturing

3. Collateral,Iowa,Residential,Adjustable

▪ 100individualsonAmazonTurk(20questionseach)

▪ Humanbutnotspecialized

Instrument:Awordintrusiontask

Participants

7 . 3

Page 24: Corporate Fraud, LDA, and Econometrics

Quasi-experimentaldesign

▪ 3Computeralgorithms(>10Mquestionseach)

▪ Nothumanbutspecialized

1. GloVeongeneralwebsitecontent

▪ Lessspecificbutmorebroad

2. Word2vectrainedonWallStreetJournalarticles

▪ Morespecific,businessoriented

3. Word2vecdirectlyonannualreports

▪ Mostspecific

Theselearnthe“meaning”ofwordsinagivencontext

Runtheexactsameexperimentasonhumans

7 . 4

Page 25: Corporate Fraud, LDA, and Econometrics

Experimentalresults

Experiment Internet WSJ Filings

10

20

30

40

50

60

70

ValidationofLDAmeasure(Intrusiontask)

Maximumaccuracy

Averageaccuracy

Minimumaccuracy

Randomchance

Datasource

%ofquestionscorrect

7 . 5

Page 26: Corporate Fraud, LDA, and Econometrics

Issue3:Predictivemodeling

8 . 1

Page 27: Corporate Fraud, LDA, and Econometrics

Backtesting

▪ So,wewillbacktest

▪ Usehistoricaldatatovalidateourmodel

▪ Problems:

1. Misreportingchangesovertime

2. Misreportingisunobservable(untilit’sobservable)

Wedon’tknowwhoismisreportingtoday

8 . 2

Page 28: Corporate Fraud, LDA, and Econometrics

Movingtarget

▪ Implementamovingwindowapproach

▪ 5yearsfortraining+1yearfortesting

▪ Thestudyusesdatafrom1994through2012–14possiblewindows

▪ Ex.:topredictmisreportingin2010,trainondatafrom2005to2009

Problem:Nowwehave14models…

8 . 3

Page 29: Corporate Fraud, LDA, and Econometrics

Comparingmultiplemodels

▪ Performancemeasures:

1. ROCAUC

2. Fisherstatistics

3. Performanceatareasonablecutoff(5%)

4. NDCG@k(usuallyusedinrankingproblems)

ROCAUCandFisherstatisticswillalsoallowusto

statisticallycompareacrossmodels

8 . 4

Page 30: Corporate Fraud, LDA, and Econometrics

ROCAUCforwindowedapproaches

▪ ROCAUC

▪ Whatistheprobabilitythatarandomlyselected1isrankedhigher

thanarandomlyselected0

▪ Agoodscoreisabove0.70

▪ Aggregating:

▪ Simple:averageAUC

▪ Moreuseful:Poolpredictionstogether(withclusteringbyyear)

▪ ComparingROCAUCs

▪ Notsimple…

▪ Waldstatisticwithbootstrappedvarianceestimatesclusteredby

year

▪ ImplementedinStataasrocreg

8 . 5

Page 31: Corporate Fraud, LDA, and Econometrics

▪ Comparingmodels:Variance-

Gammatest(seeBCE)

▪ Keyinsight:differenceof

X varshasthesameMGF

astheVarianceGammadist

▪ Calculationbelow

▪ KisthemodifiedBessel

functionofthesecondkind

Purelystatisticalmethod

▪ Fisherstatistic(Fisher1932)

▪ Combiningp-values(Note: p ∼ U 0, 1 )

▪ p-valuescomefromourout-of-samplepredictionmodel

▪ Calculatedas: X = −2 ln(p )

P(X > X ) = z K z dz

[ ]

∑i=1k

i

2

1 2 ∫−∞

X −X1 2

2 Γ(k)k√π

1∣ ∣k−

21

k−21 (∣ ∣)

8 . 6

Page 32: Corporate Fraud, LDA, and Econometrics

Observability

▪ Theotherissueisthat,asofagivenyear,say2009,wedonotknow

everyfirmthatwasmisreporting

▪ Wecouldbuildanalgorithmwithperfectinformation,butitmayfall

flatoncurrent,noisydata!

▪ Itcouldalsogiveusafalseimpressionofanalgorithm’s

effectivenesswhenbacktesting

▪ Misreportingcantakealongtimetodiscover:Zale’sstartedin2004,

finishedin2009,andwasdisclosedin2011!

▪ Usedataonwhenamisreportingcasewasfirstdisclosed

▪ Ifthefraudwasn’tknownbytheendofthewindow,trainasifthat

was0(asitwasunobservablebackthen)

▪ Mimicsourcurrentsituation

Solution:Censorourdatatowhatwasknownatthepoint

intime

8 . 7

Page 33: Corporate Fraud, LDA, and Econometrics

Issue4:Infrequentevents

9 . 1

Page 34: Corporate Fraud, LDA, and Econometrics

Dealingwithinfrequentevents

▪ Fraudisinfrequent

▪ E.g.:Outof38,311firm-yearsofdata,thereare505firm-years

subjecttoAAERs

▪ Keyissue:Wemayhavemorevariablesthaneventsinawindow…

▪ Evenifwedon’t,convergenceisiffyusingalogisticmodel

▪ Afewwaystohandlethis:

1. Verycarefulmodelselection(keepitsufficientlysimple)

2. Sophisticateddegeneratevariableidentificationcriterion+

simulationtoimplementcomplexmodelsthatarejustbarely

simpleenough

▪ ThemainmethodinBCE

3. Automatedmethodologiesforpairingdownmodels(LASSO,

XGBoost)

9 . 2

Page 35: Corporate Fraud, LDA, and Econometrics

Degeneratevariableidentification

1. Tosseveryinputintoamodel

2. CheckindependentnessusingaQRdecomposition

▪ Thiswillletusdetermineanorderfordroppinginputs

▪ A = Q × R,where Aisourfeaturematrix, Qisanorthogonal

matrix,and Risthetransformation

▪ Moreweightonthediagonalelementin Rmeansmore

independent(effectively)

▪ SameunderlyingmethodasaGram-Schmidtprocess

3. Removeexcessinputsiftoofew1s

▪ Why?Becauselogitcan’tconvergeiftherearemoreinputsthan

events(ornon-events)inthedata

Independentnessisausefulcriterionforremovingfeatures

withlowerlikelihoodofbeinguseful

9 . 3

Page 36: Corporate Fraud, LDA, and Econometrics

Logisticiteration

1. RunalogitusingaNewton-Raphsonsolverfor50iterations

2. Checkconvergenceforsignsofquasi-completeness

▪ Standarderrorswillbeinthemillionsifquasi-complete

▪ Ifquasi-complete,dropthenextleastindependentvariableand

restart

3. Runa500iterationlogitusingaNewton-Raphsonsolver

4. Recheckconvergence

▪ Iffailed,dropthenextleastindependentvariableandrestart

Wewillessentiallygetthemostcomplexfeasiblemodel

withthemostindependentsetoffeatures

9 . 4

Page 37: Corporate Fraud, LDA, and Econometrics

Finalcomments

10 . 1

Page 38: Corporate Fraud, LDA, and Econometrics

Someotherinterestingresults

10 . 2

Page 39: Corporate Fraud, LDA, and Econometrics

Waystobuildonthismodel

1. UseabettertokenizersuchasspaCy

▪ Ourtokenizerdidn’tdetectnounphrases

2. Useeconometricmethodsthatarebettersuitedforsparsity

▪ E.g.:XGBoost

3. ConsiderusingamorepowerfulLDAvariantsuchassupervisedLDA

(sLDA)

4. NoneedtostopatLDA–therehavebeenalotofadvancementsinNLP

since2003

Finalnote:Themotivationbehindourworkwasnottobuilda

bettermousetrap,buttoillustratetheusefulnessdocuments’

contenttobetterunderstandcompany/managerbehavior

10 . 3

Page 40: Corporate Fraud, LDA, and Econometrics

Endmatter

11 . 1

Page 41: Corporate Fraud, LDA, and Econometrics

Thanks!

Dr.RichardM.Crowley

SMU

Web:

[email protected] @prof_rmc

rmc.link

Tolearnmore:

▪ Theseslidespubliclyavailableat

▪ Plentyoflinkstoclickthroughandexplore

▪ Technicaldetailspubliclyavailableat

rmc.link/DSSG

SSRN

11 . 2

Page 42: Corporate Fraud, LDA, and Econometrics

▪ Predictionscoresfor1999

rankedinthe98thpercentile

▪ Firstpublicizedin2001

▪ IncreasesinIncometopicand

firmsizearethebiggestred

flags

▪ Predictionscoresfor2004

through2009rank97th

percentileorhighereachyear

▪ publishedin2011

▪ MediaandDigitalServices

topicsaretheredflags

Casestudies

AAER

11 . 3

Page 43: Corporate Fraud, LDA, and Econometrics

▪ Logofassets

▪ Totalaccruals

▪ %changeinA/R

▪ %changeininventory

▪ %softassets

▪ %changeinsalesfromcash

▪ %changeinROA

▪ Indicatorforstock/bond

issuance

▪ Indicatorforoperatingleases

▪ BVequity/MVequity

▪ Lagofstockreturnminus

valueweightedmarketreturn

▪ BelowareBCE’sadditions

▪ Indicatorformergers

▪ IndicatorforBigNauditor

▪ Indicatorformediumsize

auditor

▪ Totalfinancingraised

▪ Netamountofnewcapital

raised

▪ Indicatorforrestructuring

Financialmodel

BasedonDechow,Ge,LarsonandSloan(2011)

11 . 4

Page 44: Corporate Fraud, LDA, and Econometrics

▪ Logof#ofbulletpoints+1

▪ #ofcharactersinfileheader

▪ #ofexcessnewlines

▪ Amountofhtmltags

▪ Lengthofcleanedfile,

characters

▪ Meansentencelength,words

▪ S.D.ofwordlength

▪ S.D.ofparagraphlength

(sentences)

▪ Wordchoicevariation

▪ Readability

▪ ColemanLiauIndex

▪ FogIndex

▪ %activevoicesentences

▪ %passivevoicesentences

▪ #ofallcapwords

▪ #of“!”

▪ #of“?”

Stylemodel(late2000s/early2010s)

Fromavarietyofresearchpapers

11 . 5