Fast data mining flow prototyping using IPython Notebook

Preview:

DESCRIPTION

Big data analysis requires fast prototyping on data mining process to gain insight into data. In this slides, the author introduces how to use IPython Notebook to sketch code pieces for data mining stages and make fast observations easily.

Citation preview

Fast data mining flow prototyping using IPython Notebook

2013/01/31

Jimmy Lai

r97922028 [at] ntu.edu.tw

Outline

1. Workflow for data mining

2. What IPython Notebook provides

3. Exemplified by text classification

4. Demo code and Notebook usage

IPython Notebook 2

Workflow for data mining

• Traditional programming workflow:

– Edit -> Compile -> Run

• Data Mining workflow:

– Execute -> Explore

– Consists of many data processing stages and we may do trials in each stage with different methods.

– Stages: data parsing, feature extraction, feature selection, model training, model predicting, post processing, etc.

IPython Notebook 3

What IPython Notebook provides

• Interactive Web IDE – Display rich data like plots by matplotlib, math

symbols by latex

– Code cell for sketching

– Execute piece of code in arbitrarily order

– Browser interface for programming remotely

– Easy to demonstrate code and execution result in html or PDF.

• IPython Notebook makes sketching data analysis easily.

IPython Notebook 4

Demo code and Notebook usage

• Demo Code: ipython_demo directory in https://bitbucket.org/noahsark/slideshare

• Ipython Notebook: – Install

$ pip install ipython

– Execution (under ipython_demo dir)

$ ipython notebook --pylab=inline

– Open notebook with browser, e.g. http://127.0.0.1:8888

IPython Notebook 5

IPython Note Interface

IPython Notebook 6

Exemplified by text classification

• Text classification on newsgroup dataset.

• Dataset:

– Build in sklearn.datasets

– Each article belongs to one of the 20 groups

• Goal: classify article to one of the newsgroup name.

• Experiment: feature generation using different ngram parameters.

IPython Notebook 7

Example article

IPython Notebook 8

talk.politics.mideast

IPython Notebook 9

Sample result of feature extraction

IPython Notebook 10

Table of experiment setups

IPython Notebook 11

IPython Notebook 12

Experiment Result

IPython Notebook 13

IPython Notebook 14

Observation from plots

IPython Notebook 15

Recommended