Jan25 - Ottawa Machine Learning Meetup

Preview:

Citation preview

CLASSIFYING OMNIBUS BILLS

OTTAWA MACHINE LEARNING MEETUP - JAN. 25TH, 2016

SAMUEL WITHERSPOON, MATHEW SONKE

DISCLAIMER

THIS IS OUR FIRST ITERATION AND IS A WORK IN PROGRESS.

PURPOSE

WE WANT TO SHOW HOW WE MOVE FROM START TO FIRST SET OF RESULTS IN AN ML PROBLEM

SUMMARY OF EFFORT≈ 50 HOURS SPENT

≈ 120 BILLS MANUALLY CLASSIFIED

SOURCE CODE:https://github.com/switherspoon/MachineLearningMeetup

WHAT IS AN OMNIBUS BILL?

TYPICALLY VERY LONG

TYPICALLY LOTS OF OTHER BILLS MODIFIED

For Example Bill C-51

A BILL THAT HAS A WIDE VARIETY OF TOPICS

THAT DEFINITION INFORMED OUR FEATURES

LENGTH OF BILL

DIVERSITY OF TOPICS IN THE BILL

NUMBER OF OTHER BILLS MODIFIED/REFERENCED

FEATURES:

WHAT DOES AN OMNIBUS LOOK

LIKE?

BILL C-51 - 41st PARLIAMENT 2nd

SESSION

BILL C-54 - 41st PARLIAMENT 2nd

SESSION

4/19 PAGES 1/1 PAGE

GETTING STARTEDWE USED PYTHON3 WITH:

1. NLTK (http://www.nltk.org/) - FOR NLP 2. SCIKIT-LEARN (http://scikit-learn.org/stable/) - FOR CLASSIFIER 3. GENSIM (https://radimrehurek.com/gensim/) - FOR TOPIC MODEL 4. PSYCOPG2 (http://initd.org/psycopg/) - FOR DATA EXTRACT

ALL INSTALLED WITH PIP3

GETTING STARTED (CONT…)

WE SOURCED OUR DATA FROM:

https://openparliament.ca/

http://parl.gc.ca

DATA ANALYSISMANUALLY SKIMMED AND EXTRACTED FEATURES FROM ≈120 BILLS AND BUILT A SPREADSHEET

link: https://docs.google.com/spreadsheets/d/1kpbX78NZQ9bJHGVPoSmLE4LcE4Hht1UXxXg90gV1CVU/edit?usp=sharing

MODEL FEATURESLENGTH OF BILL

NUMBER OF BILLS REFERENCED

AVERAGE SEMANTIC DISTANCE OF TOPICS IN EACH BILL

THE MODEL

THE CLASSIFIERNAIVE BAYES

EASY

FAST

UNDERSTANDABLE

WORKS WELL WITH SMALL TRAINING SET (MAYBE NOT THIS SMALL)

LENGTH OF BILLLENGTH OF RAW STRING READ IN FROM FILES

AS EASY AS: len(raw)

NUMBER OF BILLS REFERENCED

(1) DATA RETRIEVAL

(2) PREPROCESSING

(3) NAMED ENTITY RECOGNITION (NER)

(1) DATA RETRIEVAL

2 DATA SETS TO COLLECT

• CONSOLIDATED LIST OF ACTS

• FULL TEXT OF BILLS

DATA RETRIEVAL CONT…

LIST OF ACTS PROVIDED BY GOVERNMENT OF CANADA (http://laws-lois.justice.gc.ca/eng/acts/)

WE NEEDED A WEB SCRAPER AS NO API IS AVAILABLE • SCRAPY IS POWERFUL BUT NO PYTHON3 SUPPORT • IMPORT.IO WORKED WELL FOR OUR NEEDS

DATA RETRIEVAL CONT…

DATA RETRIEVAL CONT…

TEXT OF BILLS RETRIEVED FROM OPENPARLIAMENT DATABASE USING SQL

(2) PREPROCESSING

OPENPARLIAMENT DATABASE ISN’T PERFECT • REMOVED DUPLICATES • VERIFIED SESSION NUMBER WAS CORRECT • CONVERTED EVERYTHING TO LOWERCASE

(3) NAMED ENTITY RECOGNITION

MANY APPROACHES TO THIS • HAND-CRAFTED GRAMMAR BASED • STATISTICAL MODELS • MATCHING AGAINST A LIBRARY

NAMED ENTITY RECOGNITION CONT…

WE NOTICED COMMON PHRASES LIKE “AMENDS”, “RELATED AMENDMENTS”, “REPLACED BY” WHEN REFERENCING ACTS

ULTIMATELY WE MATCHED BILL TEXT AGAINST A LIBRARY • THIS GAVE US GOOD RESULTS WITH LITTLE CODE • WON’T ALWAYS WORK

SEMANTIC DISTANCE OF TOPICS

HYPOTHESIS:

SINCE AN OMNIBUS BILL HAS MANY DIFFERENT TOPICS THE AVERAGE DISTANCE BETWEEN TOPICS IN AN OMNIBUS BILL WILL BE GREATER THAN A NON-OMNIBUS BILL.

SEMANTIC DISTANCE OF TOPICS PROCEDURE

(1) PREPROCESS A BILL

(2) LDA TOPIC MODELLING ON THE BILL

(3) SEMANTIC SIMILARITY (DISTANCE MEASURE)

(4) AVERAGE TOPIC DISTANCE OF THE BILL

(1) PREPROCESSING• READ IN FILES

•TOKENIZE WORDS

• REMOVE STOP WORDS

•IGNORE WORD ORDER (BAG OF WORDS)

(2) LDA TOPIC MODELING•PROBABILISTIC TOPIC MODEL

•WE ARE NOT USING IT IN ITS OPTIMAL APPLICATION

•PROBABILISTICALLY PRESUMES DOCUMENTS CONTAIN A HIDDEN STRUCTURE BUILT AROUND TOPICS

•IGNORES WORD ORDER

LDA CONT…•MANY BILLS TOO SHORT FOR MEANINGFUL ANALYSIS W/ LDA

•BILLS THAT ARE TOO SHORT GET AN AGGREGATE SIMILARITY SCORE OF ‘1’

•THIS IS A REALLY BAD WORKAROUND

•WE IGNORE THE LDA TOPIC WEIGHTS/PROBABILITIES •THIS IS AN OPTIMIZATION PROBLEM

MORE READING: https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf

(3) SEMANTIC SIMILARITYLIN SIMILARITY

BUT WHAT DOES THIS MEAN???

WORDNETA HIERARCHICAL TREE OF WORDS WITH MORE GENERAL WORDS AT THE ROOT AND MORE SPECIFIC WORDS AT

THE LEAF

SIMILARITY CONT…LIN SIMILARITY

*OVERSIMPLIFICATION* THERE IS A GRAPH/NETWORK OF SYNONYMS - LIN SIMILARITY IS THE SHORTEST DISTANCE TO THE FIRST COMMON ANCESTOR (LOWEST COMMON ANCESTOR)

SIMILARITY CONT…SCORES ARE BETWEEN 0 AND 1

>0.8 MEANS VERY SIMILAR

<0.2 MEANS NOT VERY SIMILAR

ie. CAT & DOG = 0.88 OR 0.89 (BROWN AND SEMCOR IC) HOUND & DOG = 0.88 OR 0.87 (BROWN AND SEMCOR IC) CHAIR & DOG = 0.16 OR 0.18 (BROWN AND SEMCOR IC)

(4) AVG. TOPIC DISTANCE IN A BILL

WE CREATED AN AVERAGE SIMILARITY SCORE FOR EACH BILL:

SUM OF ALL COMPARED SCORES/TOTAL NUMBER OF COMPARISONS

THERE ARE FLAWS IN THIS APPROACH •NOUN ONLY •NO WEIGHTING

CLASSIFICATION!WE WERE RUNNING OUT OF TIME…..

WE WANTED TO COMPARE: •NAIVE BAYES •RANDOM FOREST DECISION TREE •SVM

WE COMPARED: •NAIVE BAYES!

CLASSIFIER COMPARISON

:(

NAIVE BAYES •GAUSSIAN •MULTINOMIAL

MODEL EVALUATION

WE WONT SHOW YOU ACCURACY BECAUSE…

CLASS IMBALANCE!

•9 OMNIBUS BILLS IN 120 BILLS

•7.5% CHANCE A BILL IS AN OMNIBUS BILL

•A CLASSIFIER COULD HAVE 92.5% ACCURACY BY PICKING ‘NOT OMNIBUS’ EVERY TIME!

PRECISION True Positives / (True Positives + False Positives)

RECALL (True Positives / (True Positives + False Negatives))

BUT WE HAVE A CLASS IMBALANCE PROBLEM

PRETENDING WE DON’T HAVE A PROBLEM

CLASS IMBALANCE SOLUTION

REMOVE THE IMBALANCE!!!!

WE WENT FROM 65 TRAINING EXAMPLES TO 25 TO 11 BY REMOVING NEGATIVE EXAMPLES

RESULTSTRUE CLASS IMBALANCE

(5:60)

NEW (5:20)

RATIOS ARE (#OMNIBUS:#NOTOMNIBUS)

REMOVING EVEN MORENEW (5:20)

NEWEST (5:6)

FINAL TRAINING SET

} }

CONCLUSIONS

EITHER NEED: (1)SUBSTANTIALLY MORE DATA OR; (2)BETTER ACCURACY ON TOPIC EXTRACTION AND

NAMED ENTITY RECOGNITION

LOTS OF ROOM FOR IMPROVEMENT

WE STILL THINK THREE FEATURES IS ENOUGH

NEED TO DO MORE WORK CLEANING/VALIDATING OUR INPUT DATA

CONCLUSIONS CONT…

WE ARE PERFORMING BETTER THAN RANDOM GUESSING!

WE WOULD LOVE HELP IMPROVING OUR APPROACH

WAYS TO IMPROVEUSE MORE COMPLEX NER IMPLEMENTATION TO IMPROVE ACCURACY

LINKED TOPIC MODELLING

IMPROVE WORD SIMILARITY APPROACH TO INCLUDE WEIGHTINGS

EXPERIMENT WITH DOCUMENT VECTORS AND NEURAL NETS

USE DIFFERENT DISTRIBUTIONS FOR DIFFERENT FEATURES (OPTIMIZATION OF CLASSIFIER)

TRY TF/IDF AS A DIFFERENT METHOD FOR MEASURING THE ‘SEMANTIC DIFFERENCE’ IN A DOCUMENT

EXPERIMENT WITH OTHER CLASSIFIERS

EXPERIMENT WITH MORE FEATURES

QUESTIONS?

Machine learning is no cakewalk.

Can we form a group to help Ottawa companies achieve greater success with ML?

What would this group do? Who would be in it?

How would it be funded? Do we have the local talent? What about protecting IP?

Who would make the decisions? Why bother?

We want your feedback! If you'd like to participate in ongoing discussions, please leave

us your contact info.

RELATIVE OPERATING CHARACTERISTICS (ROC)

0

0.25

0.5

0.75

1

0 0.25 0.5 0.75 1

Random GuessGaussianMultinomial

FALSE POSITIVE RATE

TRU

E PO

SITI

VE

RATE

Recommended