15
Fast data mining flow prototyping using IPython Notebook 2013/01/31 Jimmy Lai r97922028 [at] ntu.edu.tw

Fast data mining flow prototyping using IPython Notebook

Embed Size (px)

DESCRIPTION

Big data analysis requires fast prototyping on data mining process to gain insight into data. In this slides, the author introduces how to use IPython Notebook to sketch code pieces for data mining stages and make fast observations easily.

Citation preview

Page 1: Fast data mining flow prototyping using IPython Notebook

Fast data mining flow prototyping using IPython Notebook

2013/01/31

Jimmy Lai

r97922028 [at] ntu.edu.tw

Page 2: Fast data mining flow prototyping using IPython Notebook

Outline

1. Workflow for data mining

2. What IPython Notebook provides

3. Exemplified by text classification

4. Demo code and Notebook usage

IPython Notebook 2

Page 3: Fast data mining flow prototyping using IPython Notebook

Workflow for data mining

• Traditional programming workflow:

– Edit -> Compile -> Run

• Data Mining workflow:

– Execute -> Explore

– Consists of many data processing stages and we may do trials in each stage with different methods.

– Stages: data parsing, feature extraction, feature selection, model training, model predicting, post processing, etc.

IPython Notebook 3

Page 4: Fast data mining flow prototyping using IPython Notebook

What IPython Notebook provides

• Interactive Web IDE – Display rich data like plots by matplotlib, math

symbols by latex

– Code cell for sketching

– Execute piece of code in arbitrarily order

– Browser interface for programming remotely

– Easy to demonstrate code and execution result in html or PDF.

• IPython Notebook makes sketching data analysis easily.

IPython Notebook 4

Page 5: Fast data mining flow prototyping using IPython Notebook

Demo code and Notebook usage

• Demo Code: ipython_demo directory in https://bitbucket.org/noahsark/slideshare

• Ipython Notebook: – Install

$ pip install ipython

– Execution (under ipython_demo dir)

$ ipython notebook --pylab=inline

– Open notebook with browser, e.g. http://127.0.0.1:8888

IPython Notebook 5

Page 6: Fast data mining flow prototyping using IPython Notebook

IPython Note Interface

IPython Notebook 6

Page 7: Fast data mining flow prototyping using IPython Notebook

Exemplified by text classification

• Text classification on newsgroup dataset.

• Dataset:

– Build in sklearn.datasets

– Each article belongs to one of the 20 groups

• Goal: classify article to one of the newsgroup name.

• Experiment: feature generation using different ngram parameters.

IPython Notebook 7

Page 8: Fast data mining flow prototyping using IPython Notebook

Example article

IPython Notebook 8

talk.politics.mideast

Page 9: Fast data mining flow prototyping using IPython Notebook

IPython Notebook 9

Page 10: Fast data mining flow prototyping using IPython Notebook

Sample result of feature extraction

IPython Notebook 10

Page 11: Fast data mining flow prototyping using IPython Notebook

Table of experiment setups

IPython Notebook 11

Page 12: Fast data mining flow prototyping using IPython Notebook

IPython Notebook 12

Page 13: Fast data mining flow prototyping using IPython Notebook

Experiment Result

IPython Notebook 13

Page 14: Fast data mining flow prototyping using IPython Notebook

IPython Notebook 14

Page 15: Fast data mining flow prototyping using IPython Notebook

Observation from plots

IPython Notebook 15