Data Day TX 2016 - Jan 16, 2016

Michelle Casbon

January 16, 2016 – Data Day Texas, Austin

Under the Hood of Idibon’s Scalable NLP Services

2

• Idibon creates adaptive machine intelligence that can analyze text in any language

What do we do?

natural language text

social media

structured insights

3

• Background• Process walk-through• Platform description• Why we chose Spark• How we’re using Spark ML & MLlib• Challenges of adopting Spark in a distributed NLP

system

Agenda

4

Supply Chain RiskIntent to purchase

What are our use cases?

Global health trends

Interactive Voice Response

Multilingual news SMS PrioritizationChange reception

How do we do it?

• Fewer annotations• Lower costs• Less time spent training• Higher accuracy• Improves over time

labeled training set

human annotation intelligent queuing&

machine learning

unlabeled poolAdaptive learning

7

How do we do it?Dataset

Models

Identification2

Cleansing3

Training data creation4

Quality Control5

Creation6

Hyperparameter Tuning7

Intelligent Queueing

8

Rule Creation910 Unseen Data

Prediction

Goal Definition1

8

• Real-time API support• Document storage• 1000’s of individual predictions per second• Continuous training• Hyperparameter optimization

Scalability Challenges

What does our platform look like?

10

• Wide variety of algorithms• Active development• Open source• Industry-standard algorithm implementation• Intended for use in enterprise applications• Scalability

Why are we using Spark?

11

• Feature Extraction• TF-IDF• Word2Vec• Dimensionality reduction

• Training• Logistic Regression• SVM• Naïve Bayes• LDA

• Prediction• Evaluation metrics

How are we using Spark?

[1.0, [1.0, 0.0, 3.0]]

Feature Extraction

Training

Prediction

12

Feature Extraction

Extract Content Tokenize

Bigrams

Trigrams

Feature Lookup

[1.0, 0.0, 3.0]

Vector

13

Training

LogisticRegressionWithLBFGS

[1.0, [1.0, 0.0, 3.0]]

LabeledPoint

Model Storage

[1.0, 0.0, 3.0]

Vector

Add classification

LogisticRegressionModel

14

Prediction

Extract Content Tokenize

Bigrams

Trigrams

Feature Lookup

[0.0, 1.0, 4.0]

Vector

Model Lookup

Predict

New tweet

[0.0, 1.0, 4.0]

Vector

Classification Lookup

15

How do we provide online predictions with Spark?

… if you have small data

Task Time in µs

Vector prediction 300

DataFrame prediction 7800

DataFrames are slow ...

16

How do we fit Spark into our existing system?

Core functionality

Idiboncustom ML

…

REST API

ML persistence layer

17

• Real-time operationalization of many, many models• Embed within different platforms• Single save/load framework• Rapidly incorporate new NLP features• Logging/monitoring standardized & abstracted

How does a persistence layer enable us to use Spark?

18

• Analyzing human language is hard• We’re using exciting tools to build performant NLP

systems that are faster & better than ever before• Introduce yourself!

Summary

19

Questions?Michelle Casbon

[email protected]@texasmichelle

Engineering

Data Day TX 2016 - Jan 16, 2016