Upload
mortardata
View
1.057
Download
0
Tags:
Embed Size (px)
Citation preview
Machine learning @ NYTDae Il Kim - [email protected]
Overview● Assisting Great Journalism: The Story of Faulty Takata Airbags
○ Using Logistic Regression to help uncover suspicious comments
● Extracting insights from big data - A Bayesian perspective○ BNPy: A fully pythonic framework for Bayesian Nonparametric Models○ Refinery: A Locally Deployable Web App for Scalable Topic Modeling
● Using ML to help news-related non journalistic problems○ Single Copy - Using ML to effectively predict the number of papers to print○ Subscribers - Retention and Audience Acquisition○ Recommendations - Using collaborative topic models for recommendations
Complaints data from NHTSA complaints
The DataData contains 33,204 comments with 2219 of these painstakingly labeled as being suspicious (by Hiroko Tabuchi).
A Machine Learning ApproachDevelop a prediction algorithm that can predict whether a comment was either suspicious or not. The algorithm will then learn from the dataset which features are representative of a suspicious comment.
The Machine Learning ApproachA sample comment. We will preprocess this data for the algorithm
- NEW TOYOTA CAMRY LE PURCHASED JANUARY 2004 - ON FEBRUARY 25TH KEY WOULD NOT TURN (TOOK 10 - 15 MINUTES TO START IT) - LATER WHILE PARKING, THE CAR THE STEERING LOCKED TURNING THE CAR TO THE RIGHT - THE CAR ACCELERATED AND SURGED DESPITE DEPRESSING THE BRAKE (SAME AS ODI PEO4021) - THOUGH THE CAR BROKE A METAL FLAG POLE, DAMAGED A RETAINING WALL, AND FELL SEVEN FEET INTO A MAJOR STREET, THE AIR BAGS DID NOT DEPLOY - CAR IS SEVERELY DAMAGED: WHEELS, TIRES, FRONT END, GAS TANK, FRONT AXLE - DRIVER HAS A SWOLLEN AND SORE KNEE ALONG WITH SIGNIFICANT SOFT TISSUE INJURIES INCLUDING BACK PAIN *SC *JB
(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Break this into individual words (NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Break this into bigrams (every two word combinations)
TOKENIZE
FILTER
(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Remove tokens that appear in less than 5 comments(NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Remove bigrams that appear in less than 5 comments
DATA IS READY FOR TRAINING!
The data now consists of 33,204 examples with 56,191 features
Cross-ValidationCo
mm
ent I
D
Features (i.e word frequency)
0 0 0 3 1 0 2 0...
1 0 0 0 2 0 1 1...
...
1 1 5 1 2 0 0 1...
This is our training set. Take a subset of the data for training
S
NS
S
S
NS
NS
NS
NS
NS
Labels (S = Suspicious, NS = Not Suspicious)
This is our test set. After training, test on this dataset to obtain accuracy measures.
How did we do?
Experiment SetupWe hold out 25% of both the suspicious and not suspicious comments for testing and train on the rest. We do this 5 times, creating random splits and retraining the model with these splits.
Performance!We obtain a very high AUC (~.97) on our test sets.
Check what we missedThese comments are potentially worth checking twice.
The most predictive words / features
Predictive of a suspicious comment
Predictive of a normal comment.
After training the model, we then applied this on the full dataset.
We looked for comments that Hiroko didn’t label as being suspicious, but the algorithm did to follow up on (374 / 33K total).
Result: 7 new cases where a passenger was injured were discovered from those comments she missed.
Understanding Documents using Topic Models
There are reasons to believe that the genetics of an organism are likely to shift due to the extreme changes in our climate. To protect them, our politicians must pass environmental legislation that can protect our future species from becoming extinct…
Decompose documents as a probability distribution over “topic” indices
1
0“Politics”
“Climate Change”
“Genetics”
“Climate Change” “Genetics”“Politics”
Topics in turn represent probability distributions over the unique words in your vocabulary.
Topic Models: A Graphical Model PerspectiveLDA: Latent Dirichlet Allocation (Bayesian Topic Model)
Blei et. al, 2001
1
0“Politics”
“Climate Change”
“Genetics” dna: 2, obama: 1, state: 1, gene: 2, climate: 3, government: 1, drug: 2, pollution: 3
Bayes Theorem
Prior belief about the world. In terms of LDA, our modeling assumptions / priors.
Normalization constant makes this problem a lot harder. We need this for valid probabilities.
Likelihood. Given our model, how likely is this data?
Posterior distribution. Probability of our new model given the data.
Posterior Inference in LDA
GOAL: Obtain this posterior
which means that we need to calculate this intractable term:
For LDA, this represents the posterior over latent variables representing how much a document contains of topic k (θ) and topic word assignments z.
LDA: Latent Dirichlet Allocation (Bayesian Topic Model)
Blei et. al, 2001
Scalable Learning & Inference in Topic Models
LDA: Latent Dirichlet Allocation (Bayesian Topic Model)
Blei et. al, 2001
Analyze a subset of your total documents before updating.
Update θ, z, and β after analyzing each mini-batch of documents.
Please check out BNPy (Bayesian Nonparametric Python)
Open source and supports a large set of powerful Bayesian nonparametric models. Actively maintained and highly scalable code.
git clone https://bitbucket.org/michaelchughes/bnpy-dev/
Refinery: An open source web-app for large document analyses
Daeil Kim @ New York TimesFounder of [email protected]
Ben Swanson @ MIT Media LabCo-Founder of [email protected]
Refinery is a 2014 Knight Prototype Fund winner. Check it out at: http://docrefinery.org
Installing Refinery
1) Command → git clone https://github.com/daeilkim/refinery.git2) Go to the root folder. Command → vagrant up3) Open brower and go to --> 11.11.11.11:8080
3 Simple Steps to get Refinery runningInstall these first!
A Typical Refinery Pipeline
Step 1: Upload documents
Step 2: Extract Topics from a Topic Model
Step 3: Find a subset of documents with topics of interest.
Step 4: Discover Interesting Phrases
A Quick Refinery Demo
Extracting NYT articles from keyword “obama” in 2013.
What themes / topics defined the Obama administration during 2013?
Future Directions: Better tools for Investigative Reporting
Collecting & Scraping
Data
Refinery focuses on extracting insights from relatively clean data
Great tools like DocumentCloud take care of steps 1 & 2
Enterprise stories might be completed in a fraction of the time.
Filtering& Cleaning
Data
Extracting Insights
Part 3: Using ML to help in non-news related endeavors
Training predictive models for each part of this funnel
We’re interested in developing a meaningful loyal relationship with our readers. Can we discover covariates that indicate better ways to obtain and maintain that relationship with our audience?
Starbucks Single Copy
Using machine learning to predict the number of actual copies we should sell to Starbucks outlets across the nation.
Understanding international audiences
Part of our ability to expand the New York Times internationally will be to leverage algorithms based off of topic models to help understand reading patterns and behaviors.
Making better recommendations
Given how people read the news and some of their demographic info, can we make better recommendations for articles?
Even better, if they haven’t read anything what kind of recommendations can we make given just their metadata?
Age: 32State: NYJob: Student
read
recommend
Attract first time users with relevant content