19
Michelle Casbon January 16, 2016 – Data Day Texas, Austin Under the Hood of Idibon’s Scalable NLP Services

Data Day TX 2016 - Jan 16, 2016

Embed Size (px)

Citation preview

Page 1: Data Day TX 2016 - Jan 16, 2016

Michelle Casbon

January 16, 2016 – Data Day Texas, Austin

Under the Hood of Idibon’s Scalable NLP Services

Page 2: Data Day TX 2016 - Jan 16, 2016

2

• Idibon creates adaptive machine intelligence that can analyze text in any language

What do we do?

natural language text

social media

structured insights

Page 3: Data Day TX 2016 - Jan 16, 2016

3

• Background• Process walk-through• Platform description• Why we chose Spark• How we’re using Spark ML & MLlib• Challenges of adopting Spark in a distributed NLP

system

Agenda

Page 4: Data Day TX 2016 - Jan 16, 2016

4

Supply Chain RiskIntent to purchase

What are our use cases?

Global health trends

Interactive Voice Response

Multilingual news SMS PrioritizationChange reception

Page 5: Data Day TX 2016 - Jan 16, 2016

How do we do it?

Page 6: Data Day TX 2016 - Jan 16, 2016

• Fewer annotations• Lower costs• Less time spent training• Higher accuracy• Improves over time

labeled training set

human annotation intelligent queuing&

machine learning

unlabeled poolAdaptive learning

Page 7: Data Day TX 2016 - Jan 16, 2016

7

How do we do it?Dataset

Models

Identification2

Cleansing3

Training data creation4

Quality Control5

Creation6

Hyperparameter Tuning7

Intelligent Queueing

8

Rule Creation910 Unseen Data

Prediction

Goal Definition1

Page 8: Data Day TX 2016 - Jan 16, 2016

8

• Real-time API support• Document storage• 1000’s of individual predictions per second• Continuous training• Hyperparameter optimization

Scalability Challenges

Page 9: Data Day TX 2016 - Jan 16, 2016

What does our platform look like?

Page 10: Data Day TX 2016 - Jan 16, 2016

10

• Wide variety of algorithms• Active development• Open source• Industry-standard algorithm implementation• Intended for use in enterprise applications• Scalability

Why are we using Spark?

Page 11: Data Day TX 2016 - Jan 16, 2016

11

• Feature Extraction• TF-IDF• Word2Vec• Dimensionality reduction

• Training• Logistic Regression• SVM• Naïve Bayes• LDA

• Prediction• Evaluation metrics

How are we using Spark?

[1.0, [1.0, 0.0, 3.0]]

Feature Extraction

Training

Prediction

Page 12: Data Day TX 2016 - Jan 16, 2016

12

Feature Extraction

Extract Content Tokenize

Bigrams

Trigrams

Feature Lookup

[1.0, 0.0, 3.0]

Vector

Page 13: Data Day TX 2016 - Jan 16, 2016

13

Training

LogisticRegressionWithLBFGS

[1.0, [1.0, 0.0, 3.0]]

LabeledPoint

Model Storage

[1.0, 0.0, 3.0]

Vector

Add classification

LogisticRegressionModel

Page 14: Data Day TX 2016 - Jan 16, 2016

14

Prediction

Extract Content Tokenize

Bigrams

Trigrams

Feature Lookup

[0.0, 1.0, 4.0]

Vector

Model Lookup

Predict

New tweet

[0.0, 1.0, 4.0]

Vector

Classification Lookup

Page 15: Data Day TX 2016 - Jan 16, 2016

15

How do we provide online predictions with Spark?

… if you have small data

Task Time in µs

Vector prediction 300

DataFrame prediction 7800

DataFrames are slow ...

Page 16: Data Day TX 2016 - Jan 16, 2016

16

How do we fit Spark into our existing system?

Core functionality

Idiboncustom ML

REST API

ML persistence layer

Page 17: Data Day TX 2016 - Jan 16, 2016

17

• Real-time operationalization of many, many models• Embed within different platforms• Single save/load framework• Rapidly incorporate new NLP features• Logging/monitoring standardized & abstracted

How does a persistence layer enable us to use Spark?

Page 18: Data Day TX 2016 - Jan 16, 2016

18

• Analyzing human language is hard• We’re using exciting tools to build performant NLP

systems that are faster & better than ever before• Introduce yourself!

Summary

Page 19: Data Day TX 2016 - Jan 16, 2016

19

Questions?Michelle Casbon

[email protected]@texasmichelle