to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The...

Preview:

Citation preview

Using Density-Based Clustering Approaches to Address Product Classification based on Kaggle Data

Denis Whelan& Jin Ming

December 3, 2017

▶ Background: ◦ The Otto Group is one of the largest e-commerce companies in the world.◦ Arranging millions of products from a variety of different products and countries is a complex task

that requires a sophisticated approach. ▶ Data:

◦ ~62,000 samples, 93 numeric features, 9 target labels◦ Target labels are hidden but represent key categories such as electronics, fashion, etc.

▶ Kaggle Competition Purpose: ◦ Build a predictive model which can accurately classify products into the 9 appropriate categories

(supervised learning)

Introduction: Product Classification Challenge

▶ Our Purpose: ◦ Apply density-based clustering methods to this

product classification problem to cluster all products (unsupervised)

● CLARA (1990): ● The basic k-medoids method for large data applications

● DBSCAN (1996): ● The original density-based method

● NG - DBSCAN (2016):● Modified DBSCAN method

Methods: CLARA, DBSCAN, NG-DBSCAN

● CLARA (Clustering Large Applications, 1990): ● Sampling with PAM

● DBSCAN (Ester et al., 1996): ● The use of density-reachable points and density-connected points● Groups data packed in high-density regions of the feature space● Separates 'core points' from 'noise points'● Recognizes clusters with arbitrary shapes

CLARA, DBSCAN

NG-DBSCAN (Lulli et al. 2016)

● Limitations of DBSCAN● Scalability is limited● Cannot handle arbitrary similarity measures, only uses Euclidean

distance● The choice of Eps and MinPts

● NG-DBSCAN: ● An approximated and distributed implementation of DBSCAN● more efficient because of approximation● can represent item dissimilarity through any symmetric distance function

NG-DBSCAN

● Phase 1: ● create ε-graph

i. form neighbor graph by connecting each node to k random other nodes

ii. edges are added to ε-graph if the distance is less than eiii. as soon as a node has M_max neighbors in the ε-graph,

remove it from neighbor graph

(Lulli et al. 2016)

NG-DBSCAN

● Phase 2: ● discovering dense regions

i. coreness disseminationii. seed identificationiii. seed propagation

(Lulli et al. 2016)

Results: CLARA▶ Runtime: ◦ 0.95 seconds

Results: CLARA & DBSCAN

Results: DBSCAN

▶ Runtime: ◦ 18.3 minutes

Questions?

Recommended