Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Meetup Paris NLP at Linkfluence Season 3 #5 22/05/2019
Speakers : - Alexis Dutot, Linkfluence- Benoît Lebreton, Sacha Samama and Tom Stringer, Quantmetry
MELUSINEOpen source library for email classification and feature extraction
Paris NLP, May 22, 2019
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited 2
Speakers & contents
Sacha SamamaPython / DevOps enthusiast
Tom StringerNLP enthusiast
Benoît LebretonNLP enthusiast
3 Data Scientists
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited
We are a team of Data Scientists, Engineers, Architects and Consultants dedicated to sticking to the state of
the art of the NLP discipline.
We mainly apply the state of the art to a business issue and we also
contribute to the progress of the community.
3
Quantmetry leverages its knowledge of NLP in a brand new dedicated practice
MelusineEmail classification and feature
extraction
Grand DébatBlueprint for summarizing
“Grand Débat” contributions
The project2
4
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited 5
Where everything is born
15 000Daily mails
A complex email routing increased handling timeSignificant volume of incoming emails constantly increasing
✗ ✗ ✓
Lots of flea jumps between business units
2 Objectives
Email Classification Email Summarization
> 100 Advisors impacted
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited 6
The data
2.8M emails sent between octobre 2017 and juin 2018
655Kemails in the project perimeter
112median words per email
An example of input data
Bonjours,
Suite à notre conversation téléphonique de Mardi , pourriez vous me dire la somme que je vous dois afin d'être en régularisation.
Merci bonne journée
Le mar. 22 mai 2018 à 10:20, <[email protected]> a écrit :
Bonjour.Merci de bien vouloir prendre connaissance du document ci-joint :1 - Relevé d'identité postal (contrats)Cordialement.
La Mututelle Imaginaire
La visualisation des fichiers PDF nécessite Adobe Reader.
Preprocessing
7
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited
Preprocessing best practices: ordered pipeline
Pipeline structuration, if possible force a strict order in the pipeline
suite a notre conversation_telephonique de mardi pourriez_vous me dire la somme que je vous dois afin d'être en régularisation
Tokenizer
merci de bien vouloir prendre connaissance du document ci_joint releve d’identite postal (contrats)
[‘suite’, ‘a’, ‘notre’ ‘conversation_telephonique’ ‘de’ ‘mardi’ ‘pourriez_vous’ ‘me’ ‘dire’ ‘la’ ‘somme’ ‘que’ ‘je ‘vous’ ‘dois’ ‘afin’ ‘d’ ‘être’ ‘en’ ‘regularisation’]
[‘merci’, ‘de’, ‘bien’, ‘vouloir’ ‘prendre’ ‘connaissance’ ‘du’ ‘document’ ‘ci_joint’ ‘releve’ ‘d’ ‘identite’ ‘postal’ ‘contrats’]
Bonjours,
Suite à notre conversation téléphonique de Mardi , pourriez vous me dire la somme que je vous dois afin d'être en régularisation.
Merci bonne journée
Le mar. 22 mai 2018 à 10:20, <[email protected]> a écrit :
Bonjour.Merci de bien vouloir prendre connaissance du document ci-joint :1 - Relevé d'identité postal (contrats)Cordialement.
La Mututelle Imaginaire
La visualisation des fichiers PDF nécessite Adobe Reader.
HELLO
GREETINGS
HELLO
GREETINGS
SIGNATURE
FOOTERS
Split conversation Parts tagging Cleaning Embeddings(Word2Vec)
Phraser
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited
Preprocessing best practices: Interoperability
Unified vocabulary (NLTK, gensim, keras, spacy, …)Don’t build several vocabularies (one for keras and an other one for gensim)Let gensim build the vocabularyConsider removing stopwords
Unified tokenizer« I’m 24 » a nltk tokenizer : « I’ », « m », « 24 »« I’m 24 » keras tokenizer : « I’m », « 24 »
Carefull of what tokenizer were used in a public word embeddingBuild a gensim embedding not on a stream of texts but on a set of already tokenized texts(output of 4)Finally train the neural network also on the already tokenized texts.
Email Classification
10
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited 11
TF-IDF + ExtraTree
Weight Embedding
+ XGBRNN CNN
Voting classifier
(hard)
Training time 3h01 6h39 35min 35min 10h50
Epochnumber - - 15 15 -
Accuracy 73,4% 70,3% 80,1% 80,2% 80,2%
Predictiontime - - > 200ms < 200ms -
Size 1,2 Gb RAM
30 Mb RAM
64MB RAM
106 Mb RAM -
Number of CPU / GPU CPU x20 CPU x20 1 GPU 1 GPU -
Deep Learning models overperforms• Designed to be trained on large amounts
of data• Can learn from the position of each words
within the email
Upper limit to the accuracy• divisive labels sometimes, • uncorrect labels sometimes
Modeling: Benchmark and Performances
655 070 emails
10% of the dataset used as test
28 classes
The CNN offers the best overall performancesq Accuracy + Prediction time + Size
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited
Focus on the selected model: CNN
12
CNN as the staple for Computer Vision• Lower level convolution layers to
learn local patterns • Higher level convolution layers to
learn global patterns
CNN for text classification• Counter-intuitively the CNN
performs well on text classification tasks !
• Close words should be semantically related
Metadata input joined to the processed text at the dense layers level to increase performance !
CNN Label
+5pton the accuracy by adding metadata
Email Summarization
13
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited 14
Automatic summary
2
1 3
4
Keywords identificationWith the words IDF scores
Reconstitution of the syntactic structure of the sentence
Reading through the syntactic graphReading through the graph depending on the types of links. Only nominal and verbal groups are kept.
Cleaning and post-processingCleaning of the resulting description
bonjour, j'aurais, comme chaque année, besoin d'une attestation d'assurance concernant l'appartement dont je suis locataire, 11a rue d’anjou, flag_cp_ marseille, pour 2018. par avance merci,
meilleurs sentiments claude xxx,
Besoin d'une attestation d'assurance concernant l'appartement marseille, pour 2018.
15
Industrialization
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited 16
Code architecture: a single pipeline for training and prediction
Transform
Predicts
Fit
Fit & Transform
FIT
Fit & Transform
Email label
Transform
Labelised training set
?
PREDICTNew emails
Infrastructure
Domain
Application
Interface
InfrastructureDatalab servers : High computing power necessary for the training, GPU for deep learing
Infrastructure« MAIF » servers : A lighter architecture is necessary for the prediction
Life cycleRe-training not automized or frequent, independants from the model life cycle.
Life cycleThe model should be stable, and be able to process daily up to 10 000 posts.
Training Production
MAIF DATALAB SERVER MAIF LEGACY SERVER
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited 17
Deploying the model: the tools for a continuous deployement pipeline
preprod branch
dev branch
Git push Git push
Git Merge
The synchronization of the code on a specific branch will trigger Jenkins to deploy the API.
Construction of Docker containers that will host the project. The corresponding files and the necessary data are selected.
Dockers deployment on the Rancher hosting platform. If a Docker is already active it will be replaced without any service.
Interaction with the model with an API.
An open source project
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited
Melusine is an open source librarydeveloped by Quantmetry and MAIF.
It is a high-level Python library foremail classification and featureextraction, capable of running on topof Scikit-Learn, Keras or Tensorflow.
It is developed with a focus on emailswritten in french!
19
Melusine
• Free software: Apache Software License 2.0• Documentation: https://melusine.readthedocs.io
https://github.com/MAIF/melusine
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited 20
A brief introduction - 5 sub packages
Mail-processing: core block
Keywords extraction
Mails classification according to labels defined by the user
Standard tools for automatic text analysis (tokenizer, phraser, embeddings...)
Contains the TransformerScheduler, a class to build your own transformation pipeline and integrate it into a scikit-learn pipeline.
prepare_email
models
summarizer
nlp_tools
utils
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited 21
A brief introduction - 5 sub packages
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited
Principe of the TransformerScheduler class
1. Orchestrates functions that apply on rows of a DataFrame and can be easily integrated into a Scikit-Learn pipeline
2. Provides Multiprocessing mode
3. Allows custom functions to be integrated into the transformation pipeline
22
How to remove business specificity without breaking the code structure
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited 23
How to handle specific context and domain knowledge
• Define the path to the conf file that the user wants to use• Ability to restore the path to the default conf file
This JSON file is the core of this package since it's used by different submodules to preprocess the data. It manages :• Custom regular expressions• Custom keywords • Custom stopwords
A config module has been implemented in order to setup its own parameters
Ø You can modify the conf.json file according to the user's specific needs!
Extract from documentation
Extract from the default conf.json
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited
Well defined all functionalities and sub-functionalities of your package and write the pseudo-code / structure.
24
Open source a project: best practices
Beforethe development phase
After the development phase
During the development phase
Give particular importance to documentation (API and package) with a lot of examples and a good code documentation
© Quantmetry 2019 | All Rights Reserved – Reproduction in whole or in part without written permission is prohibited 25
Feel free to contribute or simply test it!
> pip install melusine
Melusine 1.6.0 is ready to use and compatible with Python >= 3.5
Appendix