23
Website classification using Apache Spark Amith Nambiar

Website Classification using Apache Spark

Embed Size (px)

Citation preview

Page 1: Website Classification using Apache Spark

Website classification using Apache Spark

Amith Nambiar

Page 2: Website Classification using Apache Spark

Demo of the WebCat app

Page 3: Website Classification using Apache Spark

Business problem

Automatically classify new websites into one or more predefined categories.

Page 4: Website Classification using Apache Spark

Why?

Web logs collected from data providers have new websites popping up everyday. And these need to be categorized before they are presented to customers in reports - daily.

Page 5: Website Classification using Apache Spark

Website classification using Apache Spark's MLlib.

Page 6: Website Classification using Apache Spark

Training Data

Starting point was already categorised data in the form:

URL, category_id

www.linux.com, 10 -> (Computers and Internet)

www.coles.com.au, 20 -> (Shopping and Classifieds)

Page 7: Website Classification using Apache Spark

Training Data

Developed a crawler to crawl each of the categorised websites

2,550 websites picked for initial training and test data. URL, Category_Id -> URL, Category_Id, Features

www.coles.com.au, 10 ->

www.coles.com.au, 10, groceri deliv kitchen bench custom receiv deliveri first spend onlin liquorland cole card cole insur apparel cole credit card locat hour look hervey hervey today normal store hour monday friday 8am special store hour saturday decemb sunday decemb store store search suburb postcod search suburb postcod select locat suburb locat found pleas store store state recip inspir recip tast cole partner tast weekli plan easier visit tast cook month cole magazin everyday ingredi sensat meal famili friend latest cole cole handi video recip creativ kitchen visit cole youtub rang rang product product bakeri dairi fresh fruit cole mobil card heston liquor special diet gluten kosher foodtruck term condit corpor respons corpor respons supplier commit work …

Crawled, Stemmed and removed stop words from the data for the website

coles.com.au

Page 8: Website Classification using Apache Spark

Bayes's theorem

Page 9: Website Classification using Apache Spark

Website classification using Naive Bayes

Naive Bayes Classifier are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

Page 10: Website Classification using Apache Spark

tf-idf for weighting

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a

word is to a document in a collection or corpus

https://en.wikipedia.org/wiki/Tf-idf

Page 11: Website Classification using Apache Spark

tf-idf

The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which

helps to adjust for the fact that some words appear more frequently in general.

Page 12: Website Classification using Apache Spark

Training Data from Database/HDFS

TermDoc RDD’s

tf-idf’s

Array of LabeledPoint(classId, vector)

Calculate tf-idf’s on the features.

Create a LabelPoint for each of the training data row

model = NaiveBayes.train(labelPoints)

Train the NaiveBayes Model

model.predict(feature_vector)

Predict class

New Data e.g “Automotive”

Each row of Training data (website) is turned into this form:

(ClassId, Sparse Vector) in the form:5.0, [100,(1,44,..),(0.3,0.12,…)]

Page 13: Website Classification using Apache Spark

API first for Data science

http://engineering.pivotal.io/post/api-first-for-data-science/

Page 14: Website Classification using Apache Spark

High Level Architecture of WebCat

Page 15: Website Classification using Apache Spark

High level architecture of WebCat

Webcat App

Queues/Topics

Link Collector Service Link Crawler Service

Classification Service

Training Data

Database

Apache Spark

Categorizewww.coles.com.au

Category is “Shopping and Classifieds”

Category is “Shopping and Classifieds”

Scale the Crawler service independent of the rest of the services

Page 16: Website Classification using Apache Spark

WebCat dashboard on PWS - Pivotal Web Services

Note that the crawler service is scaled up to 6 instances for better performance.

Page 17: Website Classification using Apache Spark

Ideas for improving WebCat?

Page 18: Website Classification using Apache Spark

User feedback loop to update the model on incorrect predictions

Webcat App

Queue with topics

Link Collector Service Link Crawler Service

Classification Service

Training Data

Database

Apache Spark

Categorise www.bmw.com.

We think it is “Electronics” - Did we

get it right?

No. The Category was “Automotive”

Page 19: Website Classification using Apache Spark

Upload your own data - (website, category) pairs

Webcat App

Queue with topics

Link Collector Service Link Crawler Service

Classification Service

Training Data

Database

Apache Spark

I know kogan.com.au belongs to category

“Shopping and Classifieds” - add it to

the training data please.

More data = Better predictions?

Page 20: Website Classification using Apache Spark

User defined categories e.g realestate.com.au -> “Real Estate”

Webcat App

Queue with topics

Link Collector Service Link Crawler Service

Classification Service

Training Data

Database

Apache Spark

Create New Category

“Real Estate”

Page 21: Website Classification using Apache Spark

Provide a publicly available API for categorised websites

Webcat App

Queue with topics

Link Collector Service Link Crawler Service

Classification Service

Training Data

Database

Apache Spark

GET /websites/{id}/category

GET /websites/{id}/features

Page 22: Website Classification using Apache Spark

WebCat on Apache Madlib

http://madlib.incubator.apache.org/

Page 23: Website Classification using Apache Spark