Pycon2015 scope

===================PyCon Talk Proposal===================

:Title: Building a Large Scale Prediction system in python:Duration: 40 min:Level: Intermediate:Categories: Data Visualization and Analytics

Summary=======

This paper brings out the techniques we have followed to build a scalable accurate machine learning based classifier system for classifying the issues entered in a natural language. This system is currently in industrial production use.

The training data for the system is the logged historical usage information of original menu based system.

We have deployed the scalable machine learning based classification system. Use of this system has given > 25% improvement in accuracy as compared to original process.

The key takeaways of this talk include:

Typical challenges people will face while building a live production ready classifier

Data cleaning approaches using Python Build a Text Classifier including Choosing the right algorithm Validating the classifier Solution for building the system Deployment architecture for handling large scale and simultaneous number

of requests

Description===========The talk is about how to build a large scale text classifier in Python. It will guide the audience through each of the steps involved in building the classifier. At each of the stages practical tips will be offered based on our experience. Tips could include issues we faced, how we worked around this, possible optimization strategies, etc. At end of this audience should be able to build a working text classifier in python. Have identified 7 possible sections the talk can cover. With an average of 5 minutes per section we should be able to comfortably cover the topic in 35 minutes leaving 5 minutes to questions and answers in the end.

Below is the high level section wise break up of content. 1. Handling unstructured text.

This section will bring out the techniques we will follow for handling unstructured text. How can we do the following operations on the unstructured text using python will be covered in this section:

a) Tokenization - Splitting text into tokensb) TF-IDF to extract most important tokens

a. Alternate techniques to extract important tokensc) Compression techniques to reduce the number of features

2. Why and How to clean the incident data?----------------------------------------The training data for the system is the logged historical usage information of original menu based system.

The historical data had many issues which included too many classes, un-balanced data and un-clean data. Provided data had names, dates and un-needed words. In this section we will bring out how we can clean the data in python using the following techniques

1. How we can do Named Entity Recognition to identify different entities so that non needed entities can be improved

a. Issues we faced while using the NLTK’s built in NERb. Alternates we can use

2. How to do Stop word removal?3. Based on availability time additional techniques we used for other data

related issues will be covered

3. Building the classifier and Predicting---------------------------This section will bring out the actual task of building the classifier. This will bring out the code / steps we followed for

1. How do we split in training and test sets2. How we choose the right algorithm

a. Issues we faced in thisb. Algorithm parameter optimizations

3. Prediction using the chosen algorithm

4. Model Validation and Storing of results

Before we freeze on algorithm we would need to validate the results. This section brings out how we can validate the results

a) Use of the metricsa. f1_scoreb. classification_report

b) Evaluation of resultsc) Train , test and Re-train loop -> If we are not satisfied with results

then how do we use the test results to improve the training data.

5. Overall Solution

This will touch up on the complete solution where all the previous blocks will be put together to show how these can be used in a live system. Different blocks include – Natural Language processing , Cleaning, Model building, Prediction, Learning Loop back

6. Building Scalable Solution-----------------------------------

Since this is the system which is used during live use this is the system which should be designed to handle the stated number of live simultaneous calls while at the same time it should be able to respond within the near real time response times.

This section will bring out the overall architecture on how the horizontal scalable solution can be built to handle the simultaneous loads and near real time performance.

7. Results and ConclusionsPython can be used for building a large scale production system. This section will touch up on the results we have got and conclusions.

Technology

Pycon2015 scope