13
Using Density-Based Clustering Approaches to Address Product Classification based on Kaggle Data Denis Whelan & Jin Ming December 3, 2017

to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

Using Density-Based Clustering Approaches to Address Product Classification based on Kaggle Data

Denis Whelan& Jin Ming

December 3, 2017

Page 2: to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

▶ Background: ◦ The Otto Group is one of the largest e-commerce companies in the world.◦ Arranging millions of products from a variety of different products and countries is a complex task

that requires a sophisticated approach. ▶ Data:

◦ ~62,000 samples, 93 numeric features, 9 target labels◦ Target labels are hidden but represent key categories such as electronics, fashion, etc.

▶ Kaggle Competition Purpose: ◦ Build a predictive model which can accurately classify products into the 9 appropriate categories

(supervised learning)

Introduction: Product Classification Challenge

▶ Our Purpose: ◦ Apply density-based clustering methods to this

product classification problem to cluster all products (unsupervised)

Page 3: to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

● CLARA (1990): ● The basic k-medoids method for large data applications

● DBSCAN (1996): ● The original density-based method

● NG - DBSCAN (2016):● Modified DBSCAN method

Methods: CLARA, DBSCAN, NG-DBSCAN

Page 4: to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

● CLARA (Clustering Large Applications, 1990): ● Sampling with PAM

● DBSCAN (Ester et al., 1996): ● The use of density-reachable points and density-connected points● Groups data packed in high-density regions of the feature space● Separates 'core points' from 'noise points'● Recognizes clusters with arbitrary shapes

CLARA, DBSCAN

Page 5: to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

NG-DBSCAN (Lulli et al. 2016)

● Limitations of DBSCAN● Scalability is limited● Cannot handle arbitrary similarity measures, only uses Euclidean

distance● The choice of Eps and MinPts

● NG-DBSCAN: ● An approximated and distributed implementation of DBSCAN● more efficient because of approximation● can represent item dissimilarity through any symmetric distance function

Page 6: to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

NG-DBSCAN

● Phase 1: ● create ε-graph

i. form neighbor graph by connecting each node to k random other nodes

ii. edges are added to ε-graph if the distance is less than eiii. as soon as a node has M_max neighbors in the ε-graph,

remove it from neighbor graph

Page 7: to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

(Lulli et al. 2016)

Page 8: to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

NG-DBSCAN

● Phase 2: ● discovering dense regions

i. coreness disseminationii. seed identificationiii. seed propagation

Page 9: to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

(Lulli et al. 2016)

Page 10: to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

Results: CLARA▶ Runtime: ◦ 0.95 seconds

Page 11: to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

Results: CLARA & DBSCAN

Page 12: to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

Results: DBSCAN

▶ Runtime: ◦ 18.3 minutes

Page 13: to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The Otto Group is one of the largest e-commerce companies in the world. Arranging millions

Questions?