Meetup Paris NLP at Linkfluence...Meetup Paris NLP at Linkfluence Season 3 #5 22/05/2019 Speakers : - Alexis Dutot, Linkfluence - Benoît Lebreton, Sacha Samama and Tom Stringer,

Meetup Paris NLP at Linkfluence Season 3 #5 22/05/2019

Speakers : - Alexis Dutot, Linkfluence- Benoît Lebreton, Sacha Samama and Tom Stringer, Quantmetry

MELUSINEOpen source library for email classification and feature extraction

Paris NLP, May 22, 2019

© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited 2

Speakers & contents

Sacha SamamaPython / DevOps enthusiast

Tom StringerNLP enthusiast

Benoît LebretonNLP enthusiast

3 Data Scientists

© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited

We are a team of Data Scientists, Engineers, Architects and Consultants dedicated to sticking to the state of

the art of the NLP discipline.

We mainly apply the state of the art to a business issue and we also

contribute to the progress of the community.

3

Quantmetry leverages its knowledge of NLP in a brand new dedicated practice

MelusineEmail classification and feature

extraction

Grand DébatBlueprint for summarizing

“Grand Débat” contributions

https://github.com/MAIF/melusine

https://github.com/Quantmetry/grand-debat

The project2

4


Where everything is born

15 000Daily mails

A complex email routing increased handling timeSignificant volume of incoming emails constantly increasing

✗ ✗ ✓

Lots of flea jumps between business units

2 Objectives

Email Classification Email Summarization

> 100 Advisors impacted


The data

2.8M emails sent between octobre 2017 and juin 2018

655Kemails in the project perimeter

112median words per email

An example of input data

Bonjours,

Suite à notre conversation téléphonique de Mardi , pourriez vous me dire la somme que je vous dois afin d'être en régularisation.

Merci bonne journée

Le mar. 22 mai 2018 à 10:20, <[email protected]> a écrit :

Bonjour.Merci de bien vouloir prendre connaissance du document ci-joint :1 - Relevé d'identité postal (contrats)Cordialement.

La Mututelle Imaginaire

La visualisation des fichiers PDF nécessite Adobe Reader.

Preprocessing

7


Preprocessing best practices: ordered pipeline

Pipeline structuration, if possible force a strict order in the pipeline

suite a notre conversation_telephonique de mardi pourriez_vous me dire la somme que je vous dois afin d'être en régularisation

Tokenizer

merci de bien vouloir prendre connaissance du document ci_joint releve d’identite postal (contrats)

[‘suite’, ‘a’, ‘notre’ ‘conversation_telephonique’ ‘de’ ‘mardi’ ‘pourriez_vous’ ‘me’ ‘dire’ ‘la’ ‘somme’ ‘que’ ‘je ‘vous’ ‘dois’ ‘afin’ ‘d’ ‘être’ ‘en’ ‘regularisation’]

[‘merci’, ‘de’, ‘bien’, ‘vouloir’ ‘prendre’ ‘connaissance’ ‘du’ ‘document’ ‘ci_joint’ ‘releve’ ‘d’ ‘identite’ ‘postal’ ‘contrats’]

Bonjours,

Suite à notre conversation téléphonique de Mardi , pourriez vous me dire la somme que je vous dois afin d'être en régularisation.

Merci bonne journée

Le mar. 22 mai 2018 à 10:20, <[email protected]> a écrit :

Bonjour.Merci de bien vouloir prendre connaissance du document ci-joint :1 - Relevé d'identité postal (contrats)Cordialement.

La Mututelle Imaginaire

La visualisation des fichiers PDF nécessite Adobe Reader.

HELLO

GREETINGS

HELLO

GREETINGS

SIGNATURE

FOOTERS

Split conversation Parts tagging Cleaning Embeddings(Word2Vec)

Phraser


Preprocessing best practices: Interoperability

Unified vocabulary (NLTK, gensim, keras, spacy, …)Don’t build several vocabularies (one for keras and an other one for gensim)Let gensim build the vocabularyConsider removing stopwords

Unified tokenizer« I’m 24 » a nltk tokenizer : « I’ », « m », « 24 »« I’m 24 » keras tokenizer : « I’m », « 24 »

Carefull of what tokenizer were used in a public word embeddingBuild a gensim embedding not on a stream of texts but on a set of already tokenized texts(output of 4)Finally train the neural network also on the already tokenized texts.

Email Classification

10


TF-IDF + ExtraTree

Weight Embedding

+ XGBRNN CNN

Voting classifier

(hard)

Training time 3h01 6h39 35min 35min 10h50

Epochnumber - - 15 15 -

Accuracy 73,4% 70,3% 80,1% 80,2% 80,2%

Predictiontime - - > 200ms < 200ms -

Size 1,2 Gb RAM

30 Mb RAM

64MB RAM

106 Mb RAM -

Number of CPU / GPU CPU x20 CPU x20 1 GPU 1 GPU -

Deep Learning models overperforms• Designed to be trained on large amounts

of data• Can learn from the position of each words

within the email

Upper limit to the accuracy• divisive labels sometimes, • uncorrect labels sometimes

Modeling: Benchmark and Performances

655 070 emails

10% of the dataset used as test

28 classes

The CNN offers the best overall performancesq Accuracy + Prediction time + Size


Focus on the selected model: CNN

12

CNN as the staple for Computer Vision• Lower level convolution layers to

learn local patterns • Higher level convolution layers to

learn global patterns

CNN for text classification• Counter-intuitively the CNN

performs well on text classification tasks !

• Close words should be semantically related

Metadata input joined to the processed text at the dense layers level to increase performance !

CNN Label

+5pton the accuracy by adding metadata

Email Summarization

13


Automatic summary

2

1 3

4

Keywords identificationWith the words IDF scores

Reconstitution of the syntactic structure of the sentence

Reading through the syntactic graphReading through the graph depending on the types of links. Only nominal and verbal groups are kept.

Cleaning and post-processingCleaning of the resulting description

bonjour, j'aurais, comme chaque année, besoin d'une attestation d'assurance concernant l'appartement dont je suis locataire, 11a rue d’anjou, flag_cp_ marseille, pour 2018. par avance merci,

meilleurs sentiments claude xxx,

Besoin d'une attestation d'assurance concernant l'appartement marseille, pour 2018.

15

Industrialization


Code architecture: a single pipeline for training and prediction

Transform

Predicts

Fit

Fit & Transform

FIT

Fit & Transform

Email label

Transform

Labelised training set

?

PREDICTNew emails

Infrastructure

Domain

Application

Interface

InfrastructureDatalab servers : High computing power necessary for the training, GPU for deep learing

Infrastructure« MAIF » servers : A lighter architecture is necessary for the prediction

Life cycleRe-training not automized or frequent, independants from the model life cycle.

Life cycleThe model should be stable, and be able to process daily up to 10 000 posts.

Training Production

MAIF DATALAB SERVER MAIF LEGACY SERVER


Deploying the model: the tools for a continuous deployement pipeline

preprod branch

dev branch

Git push Git push

Git Merge

The synchronization of the code on a specific branch will trigger Jenkins to deploy the API.

Construction of Docker containers that will host the project. The corresponding files and the necessary data are selected.

Dockers deployment on the Rancher hosting platform. If a Docker is already active it will be replaced without any service.

Interaction with the model with an API.

An open source project


Melusine is an open source librarydeveloped by Quantmetry and MAIF.

It is a high-level Python library foremail classification and featureextraction, capable of running on topof Scikit-Learn, Keras or Tensorflow.

It is developed with a focus on emailswritten in french!

19

Melusine

• Free software: Apache Software License 2.0• Documentation: https://melusine.readthedocs.io


https://melusine.readthedocs.io/



A brief introduction - 5 sub packages

Mail-processing: core block

Keywords extraction

Mails classification according to labels defined by the user

Standard tools for automatic text analysis (tokenizer, phraser, embeddings...)

Contains the TransformerScheduler, a class to build your own transformation pipeline and integrate it into a scikit-learn pipeline.

prepare_email

models

summarizer

nlp_tools

utils


A brief introduction - 5 sub packages


Principe of the TransformerScheduler class

1. Orchestrates functions that apply on rows of a DataFrame and can be easily integrated into a Scikit-Learn pipeline

2. Provides Multiprocessing mode

3. Allows custom functions to be integrated into the transformation pipeline

22

How to remove business specificity without breaking the code structure


How to handle specific context and domain knowledge

• Define the path to the conf file that the user wants to use• Ability to restore the path to the default conf file

This JSON file is the core of this package since it's used by different submodules to preprocess the data. It manages :• Custom regular expressions• Custom keywords • Custom stopwords

A config module has been implemented in order to setup its own parameters

Ø You can modify the conf.json file according to the user's specific needs!

Extract from documentation

Extract from the default conf.json


Well defined all functionalities and sub-functionalities of your package and write the pseudo-code / structure.

24

Open source a project: best practices

Beforethe development phase

After the development phase

During the development phase

Give particular importance to documentation (API and package) with a lot of examples and a good code documentation


Feel free to contribute or simply test it!

> pip install melusine

Melusine 1.6.0 is ready to use and compatible with Python >= 3.5

Appendix

Documents

Meetup Paris NLP at Linkfluence...Meetup Paris NLP at Linkfluence Season 3 #5 22/05/2019 Speakers : - Alexis Dutot, Linkfluence - Benoît Lebreton, Sacha Samama and Tom Stringer,