Oscon miller 2011

Preview:

DESCRIPTION

The scheduled redis speaker was sick so I whipped up in about an hour and filled in on a different subject. It's a bit crude, but you get a big picture view of how to build a simple AI application using BigCouch. The accompanying video is up at http://www.youtube.com/watch?v=QEBDNxbSRuk

Citation preview

Mike Miller_milleratmitJuly 25, 2011

Bayes on your (Big)Couch

Mike Miller, Oscon 2011 2

I want my app to do _this_

Mike Miller, Oscon 2011 3

CouchDB in a slide• Schema-free document database management system

Documents are JSON objectsAble to store binary attachments

• RESTful APIhttp://wiki.apache.org/couchdb/reference

• Views: Custom, persistent representations of your dataIncremental MapReduce with results persisted to diskFast querying by primary key (views stored in a B-tree)

• Bi-Directional ReplicationMaster-slave and multi-master topologies supportedOptional ‘filters’ to replicate a subset of the dataEdge devices (mobile phones, sensors, etc.)

Mike Miller, Oscon 2011 4

BigCouch = Couch+Scaling• Open Source, Apache License

• Horizontal ScalabilityEasily add storage capacity by adding more serversComputing power (views, compaction, etc.) scales with more servers

• No SPOFAny node can handle any requestIndividual nodes can come and go

• Transparent to the ApplicationAll clustering operations take place “behind the curtain”looks (mostly) like a single server instance of CouchDB

Mike Miller, Oscon 2011 5

...back to making my app smart

Mike Miller, Oscon 2011

Sample Data

6

Weight [lbs]80 100 120 140 160 180 200 220

Hei

ght [

in]

35

40

45

50

55

60

65

70

75

80

Height vs. Weight

GirlsBoys

Height vs. Weight

Mike Miller, Oscon 2011

Naive Bayes Classifier

7

-3 -2 -1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

gaus

height

male

mean male height

male height variance

Mike Miller, Oscon 2011

Implementation Plan

8

Weight [lbs]80 100 120 140 160 180 200 220

Hei

ght [

in]

35

40

45

50

55

60

65

70

75

80

Height vs. Weight

GirlsBoys

Height vs. Weight

Model people as documents in CouchDB

Calculate Means/Variances with MapReduce

Run classifier in the CouchDB as post-MapReduce hook (“_list”)

• Note:do not need to specify fields to use in classificationmulti-class implementation continuous, incremental training! Results improve as training data trickles in.

Mike Miller, Oscon 2011

3 ways to follow along

couchapp python tool to push/pull from other couchdb’s> sudo easy_install install -U couchapp> couchapp clone ‘http://millertime.cloudant.com/bitb'create an account at cloudant.com> curl -X PUT ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’> couchapp push ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’github> git clone git@github.com:mlmiller/bayes.git

CouchDB replication to your cloudant accountbonus, brings along the data, too!

9

Mike Miller, Oscon 2011

The Code

10

Classifier (Probability Calculator)

view code to calculate means and variances

post MapReduce Hook (“_list”

method)

you can ignore everything else

client side test via node.js

Mike Miller, Oscon 2011

Data Model

11

‘class’ => training Data

Arbitrary number of numerical fields allowed

Mike Miller, Oscon 2011

Training via MapReduce

12

‘class’ => training Data

Calculate mean/variance for all numerical fields in a document

emit: ([<class>, <field>], <value>)

Reduce: _stats (Erlang builtin)

views/training/map.js

Mike Miller, Oscon 2011

Bayes: Trained State

13

pre-reduce output

Mike Miller, Oscon 2011

Bayes: Trained State

14

Count, Min, Max, Mean, Variance

Automatically Updated as new training Data Arrives

Mike Miller, Oscon 2011

Bayes Classifier

15

Load state from DB

No assumptions on Field Names

Calculate prob. for all possible hypotheses

lib/bayes_classifier.js

Mike Miller, Oscon 2011

A brief aside...

• Lets test our classifierSelect 2000 documents for testRandomly choose 1000 documents for training sampleRemaining documents used for validation

• Simulate continuous trainingAdd documents one at a timeAfter each document addition, test on all 1000 of our validation sampleRecord and plot fraction of validation sample properly classified

16

Mike Miller, Oscon 2011

A brief aside...

17

Number of documents in the training set

Dramatic improvement with additional training data

Mike Miller, Oscon 2011

... and back to the code

18

Mike Miller, Oscon 2011

test it yourself

19

• Client side test via node.js > ./test.js height=<some number> weigth=<some number>Classifier runs server side, configured in line 6 of test.js

Can point this to your DB

Mike Miller, Oscon 2011

Running as CouchApp

20

http://millertime.cloudant.com/bitb/_design/bayes/index.html

create a database (e.g., ‘bitb’) at cloudant.comadd datathen push your code>couchapp push ‘http://<user>:<pwd>@<usr>.cloudant.com/bitb’HTML & CSS served directly from BigCouch to the browserHeavy lifting of classification done server side

Mike Miller, Oscon 2011

Wrapping Up: Bayes on BigCouch• Simple code, powerful results

light requirements on data modelcan be relaxed with more complex view codeContinuous learning is very powerfule.g., time-based learning (automatically adapt to changing conditions)Classification can be performed client- or server-sidepush documents into DB and they are auto-tagged!More sophisticated classifiers easily implementede.g., Cloudant Search pre-calculates and exposes TF-IDF scores for textual classification, weighted classifiers, etcView Engine allows simple deployment of sophisticated domain libraries in mass parallele.g. Lucene, R, SciPy, NumPy, Matlab/Octave, etc..

22

Mike Miller, Oscon 2011 23

Give it a spin

Hosting, Management, Support for CouchDB and BigCouchhttp://cloudant.com

http://github.com/cloudant/bigcouch