Upload
michelle-casbon
View
445
Download
0
Embed Size (px)
Citation preview
Michelle Casbon
January 16, 2016 – Data Day Texas, Austin
Under the Hood of Idibon’s Scalable NLP Services
2
• Idibon creates adaptive machine intelligence that can analyze text in any language
What do we do?
natural language text
social media
structured insights
3
• Background• Process walk-through• Platform description• Why we chose Spark• How we’re using Spark ML & MLlib• Challenges of adopting Spark in a distributed NLP
system
Agenda
4
Supply Chain RiskIntent to purchase
What are our use cases?
Global health trends
Interactive Voice Response
Multilingual news SMS PrioritizationChange reception
How do we do it?
• Fewer annotations• Lower costs• Less time spent training• Higher accuracy• Improves over time
labeled training set
human annotation intelligent queuing&
machine learning
unlabeled poolAdaptive learning
7
How do we do it?Dataset
Models
Identification2
Cleansing3
Training data creation4
Quality Control5
Creation6
Hyperparameter Tuning7
Intelligent Queueing
8
Rule Creation910 Unseen Data
Prediction
Goal Definition1
8
• Real-time API support• Document storage• 1000’s of individual predictions per second• Continuous training• Hyperparameter optimization
Scalability Challenges
What does our platform look like?
10
• Wide variety of algorithms• Active development• Open source• Industry-standard algorithm implementation• Intended for use in enterprise applications• Scalability
Why are we using Spark?
11
• Feature Extraction• TF-IDF• Word2Vec• Dimensionality reduction
• Training• Logistic Regression• SVM• Naïve Bayes• LDA
• Prediction• Evaluation metrics
How are we using Spark?
[1.0, [1.0, 0.0, 3.0]]
Feature Extraction
Training
Prediction
12
Feature Extraction
Extract Content Tokenize
Bigrams
Trigrams
Feature Lookup
[1.0, 0.0, 3.0]
Vector
13
Training
LogisticRegressionWithLBFGS
[1.0, [1.0, 0.0, 3.0]]
LabeledPoint
Model Storage
[1.0, 0.0, 3.0]
Vector
Add classification
LogisticRegressionModel
14
Prediction
Extract Content Tokenize
Bigrams
Trigrams
Feature Lookup
[0.0, 1.0, 4.0]
Vector
Model Lookup
Predict
New tweet
[0.0, 1.0, 4.0]
Vector
Classification Lookup
15
How do we provide online predictions with Spark?
… if you have small data
Task Time in µs
Vector prediction 300
DataFrame prediction 7800
DataFrames are slow ...
16
How do we fit Spark into our existing system?
Core functionality
Idiboncustom ML
…
REST API
ML persistence layer
17
• Real-time operationalization of many, many models• Embed within different platforms• Single save/load framework• Rapidly incorporate new NLP features• Logging/monitoring standardized & abstracted
How does a persistence layer enable us to use Spark?
18
• Analyzing human language is hard• We’re using exciting tools to build performant NLP
systems that are faster & better than ever before• Introduce yourself!
Summary