Upload
lexpredict-llc
View
37
Download
0
Embed Size (px)
Citation preview
Empowering the future of legal decision-making
© LexPredict 2012-2016
@lexpredict
NLP: A Primer and Portfolio
prepared for: MSU Law Review Symposium prepared on: Mar 2017
Michael J Bommarito IIIIT Chicago-Kent / MSU / Michigan / Stanford
2
A look at our presentation agendaPresentation Section
What is NLP?
How does ML fit?
3
Sources
Example Software
Example Research
Questions
End
4
What is NLP?A Brief Primer
5
Let’s start with some text.
“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”
(Bloomberg article on Sandy)
What is NLP?
6
Real Data
When we work with real data, we often need to pre-process and clean data before we can segment and tokenize.
Consider, for example: Hand-written documents: OCR Digital formats: PDF, Word, WordPerfect, HTML Typesetting remnants, e.g., page breaks, line break hyphens
Pre-processing is very important! All subsequent work depends on this quality.
What is NLP?
7
What kind of questions can we ask?
Basic What is the structure of the text?
Paragraphs Sentences Tokens/words
What are the “words” that appear in this text? Nouns
Subjects Direct objects …
Verbs
Advanced What are the concepts that appear in this text? How does this text compare to other text?
What is NLP?
8
Segmentation and Tokenization
“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”
• Segments Types• Paragraphs• Sentences• Tokens
What is NLP?
9
Segmentation and Tokenization
But how does i t work?
Paragraphs Two consecutive line breaks A hard line break followed by an indent
Sentences Period, except abbreviation, ellipsis within quotation, etc.
Tokens and Words Whitespace Punctuation
Remember what real-world text looks l ike – think text and email.
What is NLP?
10
Segmentation and Tokenization“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 bil lion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”
Paragraphs: 2Sentences: 2Words: 561.
['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …]
What is NLP?
11
What kind of questions can we ask?
We now have an ordered list of tokens.
['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …]
Does the word phrase “quote stuffing” occur in the text? How many times does “Sandy” occur? How often does “outage” occur after “power?” What percentage of tokens are numbers?
What is NLP?
12
An Aside on Storage
Data: The word ‘the’ ten times and the word ‘a’ ten times.
Representation 1 - Ordered List: [‘the’, ‘a’, ‘the’, ‘a’, ‘the’, ‘a’, …]
Representation 2 – Term Frequency: [(‘the’, 10), (‘a’, 10)]
What is NLP?What is NLP?
13
An Aside on Storage
Representation 1 - Ordered List: [‘the’, ‘a’, ‘the’, ‘a’, ‘the’, ‘a’, …]
Representation 2 - Frequency Map: [(‘the’, 10), (‘a’, 10)]
Tradeoffs Total space Ease of answering certain questions Information about context
Not all software make the same choice!
What is NLP?
14
Stopwording, Stemming, Parsing, and Tagging
Stopwording Removing “filler ” words like prepositions, auxiliary or infinitive verbs, and
conjunctions.
Stemming Matching declined nouns like dog/dogs or child/children. Matching conjugated verbs like run/ran.
Parsing Determining the “structure” of a sentence, typically as represented by a grade
school sentence diagram (requires grammar definition; we’ll skip).
Tagging Identifying the part of speech of each token in a sentence.
What is NLP?
15
Stopwording Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 bil lion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.
Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New York suspend subway bus service forced evacuation New Jersey shore headed toward land life-threatening wind rain.
System, killed many 65 people Caribbean path north, may capable inflicting much $18 billion damage barrels New Jersey tomorrow knock power mill ions week, according forecasters risk experts.
What is NLP?
16
Stopwording + Stemming Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 bil lion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.
Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten wind rain.
System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion damag barrel New Jersey tomorrow knock power million week, accord forecast risk expert.
What is NLP?
17
What is ML?And how does it fit with NLP?
18
Definition: Automated classification and prediction on data.
Examples: Product recommenders, a la Amazon Computer vision – is it a cat? Sentiment analysis Topic classification Document clustering
At least two stages to a classification problem: Training Classification
What is Machine Learning?
19
Learning
Machine learning requires “learning ” or “training.”
There are two types of training: Supervised Unsupervised
The goal of training is to determine a mapping from input features to a set of target classes.
What is Machine Learning?
20
Learning
Imagine a student given a small l ist of organisms and descriptions. The student is tasked to assign the organisms into groups based on these descriptions. Where do the groups come from? Supervised : The teacher provides the answers while learning.Unsupervised : The teacher provides nothing while learning.
In our example, the teacher wi l l typical ly provide the “canonical” domains and kingdoms of biology. However, most real-world problems domains are not so wel l-studied.
What is Machine Learning?
21
Learning
What if the teacher gave the student some of the answers?
This is semi-supervised learning. Supervised : The teacher provides the answers while learning. Semi-supervised : The teacher provides some answers while
learning.. Unsupervised : The teacher provides nothing while learning..
What is Machine Learning?
22
Classification
The student has now learned to map from an organism’s description to a group. Now, the student is sent out into the field to use their knowledge to classify newly discovered organisms. They observe the organisms and document the features they learned to use. Then, they apply the learned rules to determine the class of organism.
What is Machine Learning?
23
Replace the student with an algorithm and we have machine learning.
Sentiment Analysis Example Organisms : Restaurant reviews Descriptions :
Number of positive phrases Number of negative phrases Number of times visited Number of restaurants reviewed Recency of review
Target: 1-5 stars for restaurant sentiment
What is Machine Learning?
24
Some Machine Learning Algorithms Supervised
Statistical models Bayesian, e.g., Naïve Bayes Classification Frequentist, e.g., Ordinary Least Squares.
Neural Networks (NN) Support Vector Machines (SVM) Random Forests (RF) Genetic Algorithms (GA)
Semi/unsupervised Neural Networks (NN) Clustering
K-means Hierarchical Radial Basis (RBF) Graph
What is Machine Learning?
25
Notes on Algorithm Diversity
Not all algorithms return scores/probabilities; some are binary. True, True, False 0.9, 0.7, 0.1
Not all algorithms support more than two classes. Cat, Dog, Mouse Cat, Not Cat
Not all algorithms scale similarly. 1M documents = 1 day 10M documents = {10 days, 100 days, 1000 days}
What is Machine Learning?
26
eDiscovery – a brief aside
3 English medium
Inputs Parameters Outputs
What is Machine Learning?
27
?Secret: Most black boxes are
very similar inside.
You just saw all of the building blocks.
eDiscovery – a brief aside
What is Machine Learning?
28
eDiscovery Terminology Translation:• Predictive Coding = Classification Problem• “Relevant”, “Privileged” – Class or Label• Review: Training a model• Production: Running a model
What is Machine Learning?
29
SourcesWhere do we see it in the wild?
30
• Statutory Material• Statues at Large• US Code• Michigan Compiled Law• Regulatory Material• Federal Register• Code of Federal Regulations• SEC Filings• FCC Orders• Judicial Material• Briefs• Opinions• (Evidence: eDiscovery)• Other Examples• Executive dialog (State of the Union, twitter)• Federal Reserve governors
Sources of Natural Language
31
Enough of government data.
Is there any useful data inside of organizations
like businesses?
What is Machine Learning?
32
Software ExampleA thinly-veiled product pitch: ContraxSuite for M&A
33
(the holy document grail)The Hope
.............
.............
.............
.............
.............
.............
.............
.............
.............
.............
.............
.............
.............
.............
.............
.............
.............
.............
.............
.............Financials
Legal
Operations
Marketing & Sales
Supply Chain
Acme, Inc. File Server
.............
.............
.............
.............
.............
.............
.............
.............
.............
.............
Real-World Example
34
(the document swamp)The Reality
Acme, Inc.
Real-World Example
35
Document Funnelimproving diligence in the real-world
STAGE 5
STAGE 4
STAGE 3
STAGE 2
STAGE 1
SearchSearch complete file stores, document
management systems, and mail servers
OrganizeUtilize both guided and automatic
document organization
IdentifyIdentify important factors in policies,
procedures, plans, and legal documents
TrainTrain new assistants on any document, no
development required.
VisualizeGenerate visualizations for both one-time
and ongoing use
Real-World Example
36
Unlocking the value in your documentsContraxSuite
Search
Organize
Identify
37
Visualize
Train
38
Unlocking the value in your documentsContraxSuite
Search
Organize
Identify
Stepsi. Identify policies, procedures,
and plansii. Identify material pre-sales
and sales discussionsiii. Identify traditional legal
agreements
Pointsi. Find all important written
communication, not just someii. Sales teams and management teams
frequently execute agreements without awareness of legal implications
39
Search
Organize
Identify
Documents can be organized using guided and automatic methods
40
Search
Organize
IdentifyFor policies, plans, and legal documentsi. Identify common clausesii. Identify common regulatory
and statutory entitiesiii. Identify common
geopolitical and business entities
iv. Customize on your own.
41
Visualize
TrainFor any type of document:• Train new clause-tagging
models• Train new clause classifiers
42
Visualize
Train
• Reports for one-time/ad-hoc analysis
• Dashboards for ongoing usage
43
Research Examples(all of the examples are mine)
44Research Examples
45Research Examples
46Research Examples
47Research Examples
48Research Examples
49Research Examples
50Research Examples
51Research Examples
NLP: A Primer and Portfolio
https://www.lexpredict.com@lexpredict
Thank you!