to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The...

Using Density-Based Clustering Approaches to Address Product Classification based on Kaggle Data

Denis Whelan& Jin Ming

December 3, 2017

▶ Background: ◦ The Otto Group is one of the largest e-commerce companies in the world.◦ Arranging millions of products from a variety of different products and countries is a complex task

that requires a sophisticated approach. ▶ Data:

◦ ~62,000 samples, 93 numeric features, 9 target labels◦ Target labels are hidden but represent key categories such as electronics, fashion, etc.

▶ Kaggle Competition Purpose: ◦ Build a predictive model which can accurately classify products into the 9 appropriate categories

(supervised learning)

Introduction: Product Classification Challenge

▶ Our Purpose: ◦ Apply density-based clustering methods to this

product classification problem to cluster all products (unsupervised)

● CLARA (1990): ● The basic k-medoids method for large data applications

● DBSCAN (1996): ● The original density-based method

● NG - DBSCAN (2016):● Modified DBSCAN method

Methods: CLARA, DBSCAN, NG-DBSCAN

● CLARA (Clustering Large Applications, 1990): ● Sampling with PAM

● DBSCAN (Ester et al., 1996): ● The use of density-reachable points and density-connected points● Groups data packed in high-density regions of the feature space● Separates 'core points' from 'noise points'● Recognizes clusters with arbitrary shapes

CLARA, DBSCAN

NG-DBSCAN (Lulli et al. 2016)

● Limitations of DBSCAN● Scalability is limited● Cannot handle arbitrary similarity measures, only uses Euclidean

distance● The choice of Eps and MinPts

● NG-DBSCAN: ● An approximated and distributed implementation of DBSCAN● more efficient because of approximation● can represent item dissimilarity through any symmetric distance function

NG-DBSCAN

● Phase 1: ● create ε-graph

i. form neighbor graph by connecting each node to k random other nodes

ii. edges are added to ε-graph if the distance is less than eiii. as soon as a node has M_max neighbors in the ε-graph,

remove it from neighbor graph

(Lulli et al. 2016)

NG-DBSCAN

● Phase 2: ● discovering dense regions

i. coreness disseminationii. seed identificationiii. seed propagation

(Lulli et al. 2016)

Results: CLARA▶ Runtime: ◦ 0.95 seconds

Results: CLARA & DBSCAN

Results: DBSCAN

▶ Runtime: ◦ 18.3 minutes

Questions?

to Address Product Classification based on Kaggle Data ...lxiong/cs570/share/project/... · The...

Documents

CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

WV-CS570 Operating Instructions

Kaggle Machine Learning Projects Ashok Kumar Harnal203.122.28.235/pdf/Kaggle_Projects_Executed_in_the_course.pdf · Kaggle and About Projects Kaggle is a platform for predictive modelling

Kaggle digits analysis_final_fc

CS570: Introduction to Data Mining - Math/CS

Before Kaggle

Kaggle Machine Learning Projects Ashok Kumar Harnalbigdata.fsm.ac.in/pdf/Kaggle_Projects_Executed_in_the_course.pdf · Kaggle Machine Learning Projects Ashok Kumar Harnal FORE School

CS570 Introduction to Data Mining - Emory University

CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa13/... · 2013-09-28 · Model Evaluation Metrics for Performance Evaluation of a Classifier ... Recall

West-Nile-Virus | Kaggle

Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

Opening Data With Kaggle

Beating Kaggle the easy way

Where Next? Data Mining Techniques and …lxiong/cs570/share/slides/mobility.pdfPattern-based (exploit pattern mining algorithms for prediction) o Sequential Pattern Mining (G. Yavas

CS570: Introduction to Data Miningcengiz/cs570-data-mining-fa13/... · 2013. 9. 19. · 11 CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters

CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

ABSTRACT Instructor: Natalia Sizova WORLD DATA: EXPLORING KAGGLE DATA SETSns10/Kaggle/pdfs/World_Data... · · 2017-05-09WORLD DATA: EXPLORING KAGGLE DATA SETS ABSTRACT Introduction

CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa13/slides/01... · CS570: Introduction to Data Mining Fall 2013 Instructors: ... Late assignment will

Kaggle Competition: Product Classification · 2020. 10. 5. · Sponsor listed above and hosted on the Sponsor's behalf by Kaggle Inc ('Kaggle'). The competition is used for CS933

Kaggle Otto Group