25
Natural Language Processing and Textual Analysis in Finance and Accounting Tim Loughran and Bill McDonald University of Notre Dame 1

Natural Language Processing and Textual Analysis in Finance and Accounting

  • Upload
    clodia

  • View
    55

  • Download
    7

Embed Size (px)

DESCRIPTION

Natural Language Processing and Textual Analysis in Finance and Accounting. Tim Loughran and Bill McDonald University of Notre Dame. Overview Data/Programs Sample App StemmingWord ListsResources. … “ ‘Cause you know sometimes words have two meanings.”. - PowerPoint PPT Presentation

Citation preview

Page 1: Natural Language Processing and Textual Analysis in Finance and Accounting

Natural Language Processing and Textual Analysis in Finance and Accounting

Tim Loughranand

Bill McDonald

University of Notre Dame

1

Page 2: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

… “ ‘Cause you know sometimes words have two meanings.”

2

Page 3: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• What do we call this?

– Textual analysis

– Natural language processing

– Sentiment analysis

– Content analysis

– Computational linguistics

3

Page 4: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Increased interest attributable to:

– Bigger, faster computers

– Availability of large quantities of text

– New technologies derived from search engines

4

Page 5: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Examples of data sources:

– EDGAR (1994-2011, 22.7 million filings)–WSJ News Archive (XML encapsulated, 2000 -> )– Audio transcripts (e.g., conference calls)–Web sites– Google searches– Twitter / Stocktwits

5

Page 6: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Programs

– Black boxes (Wordstat, Lexalytics, Diction …)

– Two critical components• Ability to download data and convert into

string/character variable• Ability to parse large quantities of text

6

Page 7: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Most modern languages provide for both of these functions:

– Perl– Python– SAS Text Miner– VB.net

7

Page 8: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Parsing large quantities of text: REGEX

• Regular expressions example– Regex that attempts to identify sentences

(?<=^|[\.!\?]\s+|\n{2,})[A-Z][^\.!\?\n]{20,}(?=([\.!\?](\s|$)))

8

Page 9: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Summary of technical literature:

Natural languages are messy and difficult to parse with computers.

Current Issues in Parsing TechnologyMasaru Tomita

Kluwer Academic Publishing, 1991p. 1

 

9

Page 10: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Tripwires – some examples

– Parsing out 10-K segments

– “May”

– Disambiguation of abbreviations

– Older files are less structured

10

Page 11: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Download 10-X– Download master files for each year/qtr

"ftp://ftp.sec.gov/edgar/full-index/YYYY/QTR#"

11

Page 12: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Identify target forms from master file

• Download forms– http://www.sec.gov/Archives/target file name

12

Page 13: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Iterate thru forms:

– Clean up text file• Remove ASCII-Encoded segments (e.g., graphics, pdfs,

etc.)• Remove XBRL• Remove tables (<TABLE>.*?</TABLE>)

• Remove all remaining markup tags (HTML)• Re-encode character entity references (e.g., &AMP = &)

13

Page 14: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Iterate thru forms: (continued)

– Parse form into tokens • Regex: ?i:\b[-A-Z]{2,}\b

• Iterate thru each token to see if it matches an entry in a master dictionary

• Tabulate words

14

Page 15: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

When creating word lists, should we list root words (lexemes) and stem, or expand all root words to include

inflections?

15

Page 16: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Stemming

– Programmatically collapse words down to root lexeme:• expensive, expensed, expensing => expense

• Inflection

– depreciate=>depreciated/depreciates/depreciating/depreciation

– Avoids morphologies like: blind / blinds; odd / odds; bitter / bitters

16

Page 17: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• The text processing literature shows that stemming does not in general improve performance. Essentially stemming does not work for morphologically rich languages.

17

Page 18: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Loughran/McDonald JF 2011 word lists

– Create a dictionary of all words occurring in 10-Ks from 1994-2007.

– Classify words occurring in 5% or more of the documents.

18

Page 19: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Loughran/McDonald JF 2011 word lists

– Fin-Neg – negative words (e.g., loss, bankruptcy, indebtedness, felony, misstated, discontinued, expire, unable). N=2,349

– Fin-Pos – positive words (e.g., beneficial, excellent, innovative). N = 354

Notice that in financial reporting it is unlikely that negative words will be negated (e.g., not terrible earnings), whereas positive words are easily qualified or compromised. Although you can easily account for simple negation, typical forms of negation are difficult to detect.

19

Page 20: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Loughran/McDonald JF 2011 word lists

– Fin-Unc – uncertainty words. Note here the emphasis is more so on uncertainty than risk (e.g., ambiguity, approximate, assume, risk). N = 291

– Fin-Lit – litigious words (e.g., admission, breach, defendant, plaintiff, remand, testimony). N = 871

20

Page 21: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Loughran/McDonald JF 2011 word lists

–Modal Strong – e.g., always, best, definitely, highest, lowest, will. N = 19

–Modal Weak – e.g., could, depending, may, possibly, sometimes. N = 27

21

Page 22: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Use of word lists:

“Content analysis stands or falls by its categories. Particular studies have been productive to the extent that the categories were clearly formulated and well

adapted to the problem”Berelson (1952, p 92)

22

Page 23: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Ziph’s law – the most frequent word will appear twice as often as the second most frequent word and three times as often as the third, etc. Much like the distribution of market cap in finance.

• Always look at the words driving your counts

23

Page 24: Natural Language Processing and Textual Analysis in Finance and Accounting
Page 25: Natural Language Processing and Textual Analysis in Finance and Accounting

Overview Data/Programs Sample App Stemming Word Lists Resources

• Resources:

– www.nd.edu/~mcdonald/Word_Lists.html

• Sentiment dictionaries• Master dictionary• Lists of stop words• 1994-2011 10-X file summaries spreadsheet

25