2016 IEEE International Conference on Big Data (Big Data ... · new shoe listings to an online catalog infrastructure, which then organizes the listings into a taxonomy tree. When

2016 IEEE International Conference on Big Data (Big Data)

978-1-4673-9005-7/16/$31.00 ©2016 IEEE 3885

Large-Scale Taxonomy Categorization for NoisyProduct Listings

Pradipto Das, Yandi Xia, Aaron Levine, Giuseppe Di Fabbrizio, and Ankur DattaRakuten Institute of Technology, Boston

Rakuten USA

{pradipto.das,ts-yandi.xia,aaron.levine,giuseppe.difabbrizio,ankur.datta}@rakuten.com

Abstract—E-commerce catalogs include a continuously grow-ing number of products that are constantly updated. Each itemin a catalog is characterized by several attributes and identifiedby a taxonomy label. Categorizing products with their taxonomylabels is fundamental to effectively search and organize listingsin a catalog. However, manual and/or rule based approachesto categorization are not scalable. In this paper, we compareseveral classifiers to product taxonomy categorization of top-level categories. We first investigate a number of feature setsand observe that a combination of word unigrams from productnames and navigational breadcrumbs work best for categorization.Secondly, we apply correspondence topic models to detect noisydata and introduce a lightweight manual process to improvedataset quality. Finally, we evaluate linear models, gradientboosted trees (GBTs) and convolutional neural networks (CNNs)with pre-trained word embeddings demonstrating that, comparedto other baselines, GBTs and CNNs yield the highest gains inerror reduction.

I. INTRODUCTION

Product taxonomy categorization is a key factor in exposing

products of merchants to potential online buyers. Most catalog

search engines use taxonomy labels to optimize query results

and match relevant listings to users’ preferences [1]. To illus-

trate the general concept, consider Fig. 1. Nordstrom pushes

new shoe listings to an online catalog infrastructure, which

then organizes the listings into a taxonomy tree. When a user

searches for the comfort shoes “Lyza Clog”, the search engine

first has to understand that the user is searching for a “comfortshoes” category. Then, if the specific shoe product cannot

be found in the inventory, relevant shoes in the “Comfort”category are shown in the search results to encourage the

user to browse further. In addition to improving the quality

of product search, good categorization also plays a critical

role in targeted advertising, personalized recommendations

and product clustering. However, there are multiple challenges

in achieving good product categorization.

Commercial product taxonomies are organized in tree struc-

tures, three to ten levels deep, with thousands of leaf nodes

[2, 3, 4, 5]. The breadth of such taxonomies lead to unavoid-

able human errors while entering data. Even a site with an

unified taxonomy where merchants directly upload listings

reported a 15% error rate in categorization [3]. For example,

in Fig. 1, a merchant has selected a wrong label, “Sneakers”,

for “1883 by Wolverine Women’s Maisie Oxford Tan/Taupe

Leather/Suede”, which contributes to labeling noise in training

data. Further, most e-commerce companies receive millions of

new listings per month from hundreds of merchants composed

of wildly different formats, descriptions, prices and meta-data

for the same products. For instance, even an identical product

can be listed as “Wilson tennis racket Level III signed byFederer” for $80 by one merchant and “Wilson tennis racquetLevel 3 Roger Federer TM” for $72 by another. A robust

classifier has to assign the same taxonomy label to such listings

(e.g., “Sports & Outdoors > Fitness > Tennis > Rackets”)

regardless of these issues.

Current e-commerce systems trade-off between accuracies

obtained by classifying a listing directly into thousands of leaf

node categories [2, 4] and splitting the taxonomy at predefined

depths [6, 3] in favor of building smaller subtree models. For

the latter scenario, there is usually another trade-off between

the number of hierarchical subtrees, the runtime latency intro-

duced by multiple categorization over the cascading subtrees

and the snowballing of error propagated in the prediction

chain. In this paper, similarly to [3], we classify product

listings in two steps: first we predict the top-level category and

then classify the listings using the subtree models selected by

the top level predictions. We use a publicly available Amazon

dataset [5] and two in-house datasets1 for our large-scale

taxonomy categorization experiments on product listings.

This paper makes several contributions to address such

large, noisy product datasets: 1) We perform large scale com-

parisons with several robust classification methods and show

that Gradient Boosted Trees (GBTs) [7, 8] and Convolutional

Neural Networks (CNNs) [9, 10] perform substantially better

than state-of-the-art linear models using word unigram features

(Section V). We further provide analysis of their performance

w.r.t. imbalance in product datasets. 2) We demonstrate that

using both listing price and navigational breadcrumbs – the

branches that merchants assign to the listings in web pages

for navigational purposes – boost categorization performance

(Section V-C) and 3) We effectively apply correspondence

topic models to detect and remove mislabeled instances in

training data with minimal human intervention (Section V-D).

1The in-house datasets are from Rakuten USA Inc. managed by RakutenIchiba, Japan’s largest e-commerce company.

3886

Fig. 1: E-commerce platform using taxonomy categorization to understand query intent, match merchant listings to potential

buyers as well as to prevent buyers from navigating away on search misses.

II. RELATED WORK

Our problem domain is similar to the one explored in

[2, 6, 3, 11, 4, 12], but with more pronounced data quality

issues. The work in [2] emphasizes the use of simple classifiers

in combination with large-scale manual efforts to reduce noise

and imperfections from categorization outputs. While human

intervention is important, we show how unsupervised topic

models can substantially reduce such expensive efforts for

product listings crawled in the wild. Further, [2] sheds no

light on the contribution of each classifier to the categorization

problem. In response, we adopted stronger baseline systems

based on regularized linear models [13, 14, 15].

A recent work from [4] emphasizes the use of recurrent neu-

ral networks for taxonomy categorization purposes. Although,

they mention that RNNs render unlabeled pre-training of word

vectors [16] unnecessary, their embedding only depends on a

recursive representation of the same initial feature space. In

contrast, we show that training word embeddings on the whole

set of three product title corpora improves performance for

CNN models and opens up the possibility of leveraging other

product corpora when available. However, unlike the recent

claims in [2] and [17], both [4] and our experiments, show

the ineffectiveness of the Naıve Bayes classifiers.

Finally, [3] advocate the use of algorithmic splitting of

the taxonomy structure using graph theoretic latent group

discovery to mitigate data imbalance problems at the leaf

nodes. They use a combination of k-NN classifiers at the

coarser level and SVMs [18] classifiers at the leaf levels. Their

SVMs solve much easier k-way multi-class categorization

problems where k ∈ {3, 4, 5} with much less data imbalance.

We, however, have found that SVMs do not work well in

scenarios where k is large and the data is imbalanced. Due to

our high dimensional feature spaces, we avoided k-NN classi-

fiers that can cause prohibitively long prediction times under

arbitrary feature transformations [19]. Further, we make use

of probabilistic latent topics to eliminate mislabeled listings

in a novel way that would otherwise render the latent group

discovery in [3] ineffective on our in-house datasets.

III. DATASET CHARACTERISTICS

We use two in-house datasets, named BU1 and BU2, and

one publicly available Amazon dataset (AMZ) [5] for the

experiments in this paper. BU1 is categorized using human

annotation efforts and rule-based automated systems. This

leads to a high precision training set at the expense of

coverage. On the other hand, for BU2, noisy taxonomy labels

from external data vendors have been automatically mapped

to an in-house taxonomy without any human error correction,

resulting in a larger dataset at the cost of precision. BU2 also

suffers from inconsistencies in product titles and meta-data that

are incomplete or malformed due to errors in the web crawlers

that vendors use to aggregate new listings. In our experiments

on the BU2 dataset, the noise is distributed identically in the

training and test sets, thus evaluation of the classifiers is not

impeded by it.

Datasets Subtrees Branches Listings PCC KL

BU1 16 1,146 12.1M 0.643 0.872

BU2 15 571 60M 0.209 0.715

AMZ 25 18,188 7.46M 0.269 1.654

TABLE I: Dataset properties on: total number of top-level

category subtrees, branches and listings.

The BU2 dataset has the noisiest ground-truth labels where

product listings have been assigned to incorrect labels. How-

ever, as the manual verification of millions of listings is

infeasible, using some proxy for ground truth is a viable

alternative that has previously produced encouraging results

[3]. To mitigate manual corrective efforts, we resort to using

topic models to detect incorrect listings (Section V-D). Topic

models such as LDA [20] have the advantage of finding latent

structure in data without the use of labels. In addition to

noisy ground truth labels, product taxonomies often exhibit

imbalanced data distribution.

3887

(a) BU1 dataset (b) AMZ dataset. Vertical scale of the top Fig. is 10xof the top Figs. 2a and 2c

(c) BU2 dataset

Fig. 2: Imbalance in evaluation datasets: The top row shows the number of branches with non-zero number of listings.

The bottom row shows the KL divergence of the empirical distribution of listings in each subtree compared to the uniform

distribution.

Fig. 3: Top-level category distribution of 40 million dedupli-

cated listings from Dec 2015 snapshot of BU2. Each category

subtree is also imbalanced, as seen in the exploded view of

the “Furniture” category.

Figure 3 shows the top-level “Home, Furniture and Patio”subtree that accounts for almost half of the BU2 dataset. The

first row in Fig. 2 shows the number of branches for top-

level taxonomy subtrees for three datasets. To measure the

level of data imbalance, we calculated the Pearson correlation

coefficient (PCC) between the number of listings and branches

in each of the top-level subtrees (Table I). A perfectly balanced

tree will have a PCC = 1.0. BU1 (Fig. 2a) shows the most

benign kind of imbalance with a PCC = 0.643. This confirms

that the number of branches in the subtrees correlate well

with the volume of listings. AMZ shows the highest average

number of branches in the subtrees. This is likely due to the

crawling technique used to harvest the data, which we believe

differs from Amazon’s internal catalog. For AMZ and BU2

in particular, the number of branches in the subtrees do not

correlate well with the volume of listings, indicating a much

higher level of imbalance.

The bottom row in Fig. 2 shows the average Kullback-

Leibler (KL) divergence, KL(p(x)|q(x)), [21] between the

empirical distribution over listings in branches for each sub-

tree, p(x), compared to a uniform distribution, q(x). Here, the

KL divergence acts as a measure of imbalance of the listing

distribution and is indicative of the categorization performance

that one may obtain on a dataset; high KL divergence leads

to poorer categorization and vice-versa (see Section V).

IV. GRADIENT BOOSTED TREES AND CONVOLUTIONAL

NEURAL NETWORKS

We observed in our experiments that the strongest perform-

ing models were GBTs [7] and CNNs [9, 10]. GBTs optimize

a loss functional: L = Ey[L(y, F (x)|X)] where F (x) can be

a mathematically difficult to characterize function, such as a

decision tree f(x) on X. The optimal value of the function is

expressed as

F �(x) =M∑

m=0

fm(x,a) (1)

where f0(x, a) is the initial guess and {fm(x,a)}Mm=1 are

incremental “boosts” on x defined by the optimization method,

with am as parameter of f(xm). To compute F �(x), we also

need to calculate

am = argmina

N∑i=1

L(yi, Fm(xi)) (2)

3888

with

Fm(x) = Fm−1(x) + fm(x, am) (3)

This expression looks very similar to:

Fm(x) = Fm−1(x)+ρm

(−[∂L(y, F (x))

∂F (x)

]F (xi)=Fm−1(xi)

)(4)

Where ρm is the step length and[∂L(y, F (x))

∂F (x)

]F (xi)=Fm−1(xi)

= gm(xi) (5)

being the search direction. To solve am, we make the basis

functions fm(xi; a) correlate most to

−[∂L(yi, F (xi))

∂F (xi)

]F (xi)=Fm−1(xi)

(6)

Where the gradients are constrained to be defined over the

training data distribution. It can be shown that

am = argmina

N∑i=1

(−gm(xi)− ρfm(xi, a))2

(7)

and

ρm = argminρ

N∑i=1

L(yi, Fm−1(xi) + ρfm(xi; am)) (8)

which yields

Fm(x) = Fm−1(x) + ρmfm(x,am) (9)

Additional details are described in [7].

In GBTs, the optimal selection of decision variables, value

splits, and means at leaf nodes (i.e., am) is based on optimizing

the fm(x, a) using a logistic loss. For GBTs, each decision

tree is robust to imbalance data and outliers [13], and F (x)can approximate arbitrarily complex decision boundaries in a

piecewise manner.

The neural network we use is based on the CNN architecture

described in [10] using the TensorFlow framework [22]. As

in [10], we enhance the performance of “vanilla” CNNs

(Fig. 4 right) using word embedding vectors [16] trained

on the product titles from all datasets, without taxonomy

labels. Context windows of width n, corresponding to n-

grams and embedded in a 300 dimensional word embedding

space, are convolved with L filters followed by rectified non-

linear unit activation and a max-pooling operation over the

set of all windows W . This operation results in a L × 1vector, which is then connected to a softmax output layer of

dimension K × 1, where K is the number of classes. The

filters are initialized with Xavier initialization [23]. We use

mini-batch stochastic gradient descent with Adam optimizer

[24] to perform parameter optimization.

The CNN model tries to allocate as few filters to the

context windows while balancing the constraints on the back-

propagation of error residuals w.r.t. cross-entropy loss L =−∑K

k=1 qk log pk, where pk is the probability of a product

title x belonging to class k predicted by our model, and

q ∈ {0, 1}K is a one-hot vector that represents the true label

of title x. This results in a higher predictive power for the

CNNs, while still matching complex decision boundaries in a

smoother fashion than GBTs.

V. EXPERIMENTAL SETUP AND RESULTS

We use Naıve Bayes (NB) [25] similar to the approach

described in [17] and [2], and Logistic Regression (LogReg)

classifiers with L1 [26] and Elastic Net regularization, as

robust baselines.

The objective functions of both GBTs and CNNs involve L2

regularizers over the set of parameters. Our development set

for parameter tuning is generated by randomly selecting 10%

of the datasets for listings under the “apparel / clothing” cat-

egories. The optimized parameters obtained from this scaled-

down configuration is extended to all other classifiers as well

to reduce experimentation time for more time-consuming clas-

sification models such as GBTs and CNNs. After parameter

tuning, we set a linear combination of 15% L1 regularization

and 85% L2 regularization for Elastic Net. For GBTs, we

limit each decision tree growth to a maximum depth of 500.

For CNNs, we use context window widths of sizes 1, 3, 4, 5 for

four convolution filters, a batch size of 1024 and an embedding

dimension of 300. The parameters for the embeddings are non-

static. LogReg classifiers and CNN need data to be normalized

along each dimension, which is not needed for NB and GBT.

A. Data Preprocessing

BU1 is exclusively comprised of product titles, hence, our

features are primarily extracted from these titles. For AMZ

and BU2, we additionally extract the list price whenever

available. For BU2, we also use the leaf node of any available

navigational breadcrumbs. In order to decrease training and

categorization run times, we employ a number of vocabulary

filtering methods. Further, English stop-words and rare tokens

that appear in ten listings or less are then filtered out. This

reduces vocabulary sizes by up to fifty percent, without a

significant reduction in categorization performance. For CNNs,

we replace numbers by the nominal form [NUM] and removed

rare tokens. We also remove punctuations and all text is low-

ercased. Parts of speech (POS) tagging using a generic tagger

from [27] trained on English text produced very noisy features,

as is expected for out-of-domain tagging. Consequently, we do

not use these features due to the absence of a POS training

set for listings as in [28]. We also experimented with word

embedding features from Word2Vec model [16], for instance,

to constrain variants of words like “t-shirts” in one embedding

dimension, however, the overall results have not been better.

B. Initial Experiments on BU1 dataset

Our initial experiments use unigram counts and three other

indicators – word bigram counts, bi-positional unigram counts,

and bi-positional bigram counts. Consider a title text “120 gbhdd 5400rpm sata fdb 2 5 mobile” from the “Data storage”leaf node of Electronics taxonomy subtree and another title

3889

Fig. 4: Classifier performance on BU1 test set. The CNN classifier has only one configuration and thus shows constant curves

in all plots. Left figure shows prediction on 10% test set using word unigram count features; middle figure shows prediction

on 10% test set using word bigram bi-positional count features; and the right figure shows mean micro-precision over different

feature setups except CNNs. In all figures, “OvO” means “One vs. One” and “OvA” means “One vs All”.

text “acer aspire v7 582pg 6421 touchscreen ultrabook 15 6full hd intel i5 4200u 8gb ram 120 gb hdd ssd nvidia geforcegt 720m” from the “Laptops and netbooks” leaf node. In such

cases, we observe that merchants tend to place terms pertaining

to storage device specifics in the front of product titles for

“Data storage” and similar terms towards the end in the titles

for “Laptops”. As such, we split the length of the title in half

and augment word uni/bigrams with a left/right-half position.

This makes sense from a Naıve Bayes point of view, since

signals like “120 gb”[Left Half], “gb hdd”[Left Half],

“120 gb”[Right Half] and “gb hdd”[Right Half] de-

correlates the feature space better, which is suitable for the

naıve assumption in NB. This also helps in sightly better

explanation of the class posteriors. These assumptions for NB

are validated in the three figures: Fig. 4 left, Fig. 4 middleand Fig. 4 right. Word unigram count features perform

strongly for all classifiers except NB, whereas bi-positional

word bigram features helped only NB significantly.

Additionally, the micro-precision and F1 scores for CNNs

and GBTs are significantly higher compared to other algo-

rithms on word unigrams using paired t-test with a p-value <0.0001. The performances of GBTs and LogReg L1 classifiers

deteriorate over the other feature sets as well. The bi-positional

and bigram feature sets also do not produce any improvements

for the AMZ dataset. Based on these initial experiments, we

focus on word unigrams in all of our subsequent experiments.

C. Categorization Improvements with Navigational Bread-crumbs and List Prices on BU2 Dataset

BU2 is a challenging dataset in terms of class imbalance and

noise and we sought to improve categorization performance

using available meta-data. To start, we experiment with a

smaller dataset consisting of ≈ 500, 000 deduplicated listings

under the “Women’s Clothing” taxonomy subtree, extracted

from our Dec 2015 snapshot of 40 million records. We then

test against ≈ 2.85 million deduplicated “Women’s Clothing”listings from the Feb 2016 BU2 dataset. In all experiments,

10% of the data is used as test set.

The first noteworthy fact in Fig. 5 is that the micro-precision

and F1 of the GBTs substantially improve after increasing the

size of the dataset. Further, stop words and rare words filtering

decrease precision and F1 by less than 1%, despite halving the

feature space. The addition of navigational leaf nodes and list

prices prove advantageous, with both features independently

boosting performance and raising micro-precision and F1 to

over 90%. Despite finding similar gains in categorization

performance for other top-level subtrees by using these meta

features, we needed a system to filter mis-categorized listings

from our training data as well.

Fig. 5: Improvements in micro-precision and F1 for GBTs on

BU2 dataset for “Women’s Clothing” subtree

3890

Fig. 6: Top few most probable words under the

latent “noise” topics over listings in “Shoes” sub-

tree. Human annotators inspect whether such sets

of words belong to a Shoes topic or not.

Fig. 7: Interpretation of latent topics using predictions

from a GBT classifier. The topics here are excluding

those in Fig. 6 and are all from Feb 2016 snapshot of

the BU2 dataset.

ρ

Fig. 8: Correspondence MMLDA model for text [29].

We also validated the fact that list prices and the leaf node

text from breadcrumbs improve classification performance by

observing the highest weighted features obtained from the

GBT model. For instance, the top thousand features from

the model for top-level classification has seven percent of

features that are breadcrumbs. The list price feature also has

been ranked in the three hundredth place out of a total of

around five hundred thousand features. Qualitative inspection

of the top one hundred features from the top-level model also

revealed an abundance of terms from the “Jewelry & Watches”

category. This observation also correlated well with the fact

that the test set performance for the top-level “Jewelry &

Watches” category has also been the highest with an yield

of 98% precision and F1.

D. Noise Analysis of BU2 Dataset using Correspondence LDAfor Text

When categorization of listings in the “Shoes” subtree

performed over 25 points below “Women’s Clothing”, as seen

in Fig. 11, we theorized that the incorrect label assignments

are to blame. There are over 34 million “Shoes” listings in

BU2 dataset and thus a manual scan is infeasible unlike [2].

To address this problem, we compute p(x) over latent topics

zk, and automatically annotate the most probable words over

each topic.

We choose the CorrMMLDA model (Fig. 8) from [29]

to discover the latent topical structure of the listings in the

Shoes category because of two reasons. Firstly, the model is

a natural choice for our scenario – it is intuitive to assume

that store and brand names (e.g. the store “QVC” and brand

“Adam Tucker” – words wd,m in the M plate of listing d in

Fig. 8) are distributions over words in titles (e.g. the words

in “Stretch criss cross strap sandals” – words wd,n in the

N plate of the same listing d in Fig. 8). The title words

are in turn distributions over the latent topics zd for listing

d ∈ {1..D}. Secondly, the CorrMMLDA model has been

shown to exhibit lower held out perplexities indicative of

improved topic quality. The reason behind the lower perplexity

stems from the following observations:

Using the notation in [29], we denote the free parameters

of the variational distributions over words in brand and store

names, say λd,m,n, as Multinomials over words in the titles

and those over words in the title, say φd,n,k, as Multinomials

over latent topics zd. It is easy to see that the posterior over

the topic zd,k for each wd,m of brand and store names, is

dependent on λ and φ through∑Nd

n=1 λd,m,n × φd,n,k. This

means that if a certain topic zd = j generates all words in the

title, i.e., φd,n,j > 0, then only that topic also generates the

brand and store names thereby increasing likelihood of fit and

reducing perplexity. The other topics zd �= j do not contribute

towards explaining the topical structure of the listing d. To

this end, the correspondence view, wNd , is made up of word

unigrams in the title, and the content view, wMd , consists of

brand and merchant names, both as bags-of-words.

We train the CorrMMLDA model with K = 100 latent

topics. A sample of nine latent topics and their most probable

words shown in Fig. 6 demonstrates topics outside of the

“Shoes” domain can be manually identified, while reducing

human annotation efforts from 3.4 million records to one

hundred. We choose K = 100 since it is roughly twice

the number of branches for the Shoes subtree. This choice

provides modeling flexibility while respecting the number of

ground truth classes.

We next run a list of the most probable six words, the

average length of a “Shoes” listing’s title, from each latent

topic through another GBT classifier trained on the full, noisy

data for the “Shoes” category. The GBT classifier used for this

3891

Fig. 9: Micro-precision on 10% test set from the

BU2 dataset across top-level categories.

Fig. 10: Micro-precision on 10% test set from the AMZ dataset

across top-level categories.

Fig. 11: Micro-precision and F1 across fifteen top-level

categories on 10% (4 million listings) of Dec 2015 BU2

snapshot.

purpose has been trained only on the words from the product

titles but without the list prices and the leaf node text from

the breadcrumbs. The reason is that the short descriptions of

the latent topics, which we want to classify, resemble short

bag-of-word queries without any metadata.

Categories that mismatch their topics are manually labeled

ambiguous, as seen in the bottom two rows in Fig. 7. As a

final confirmation, we uniformly sampled a hundred listings

from each topic judged to be ambiguous. Inspections revealed

numerous listings from merchants that do not sell shoes are

cataloged in the “Shoes” subtree due to vendor error. To

this end, we remove listings corresponding to such “out-of-category” merchants from all top-level categories.

To summarize, by manually inspecting K×6 most probable

words from the K = 100 topics and J × 100 listings,

where J << K, instead of 3.4 million, a few annotators

accomplished in hours what would have taken hundreds of

annotators several months according to the estimates in [2].

E. Results on BU2 and AMZ Datasets

Here we first report categorization performance of our

LogReg L1 baseline on the earlier Dec 2015 snapshot of BU2

dataset using just the words in the titles excepting stop words.

In section V-B, we have shown the efficacy of word unigram

features on the BU1 dataset. Figure 11 shows that LogReg

with L1 regularization [11, 30] initially achieves 83% mean

micro-precision and F1 on the initial BU2 dataset. This had

fallen short of our expectation of achieving an overall 90%

precision (red line in Fig. 11), but forms a robust baseline for

our subsequent experiments with the AMZ and the cleaned

BU2 datasets. We additionally use list price and leaf nodes of

navigational breadcrumbs for BU2 dataset and list price for

AMZ dataset when available.

Overall, Naıve Bayes, being an overly simplified generative

model, generalizes very poorly on all datasets (see Figs. 4,

9 and 10). It may be possible to improve NB’s performance

using sub-sampling techniques [31]; however, sub-sampling

can have its own problems for product datasets [2].

We observe from Table II, that when log(N/B), where Nis the number of listings and B is the number of categories, is

relatively high, – 11.54 for BU2 and 9.27 for BU1, compared

to 6.02 for AMZ, most classifiers tend to perform well. Figures

with a ∗ are statistically better than other non-starred ones in

the same row except the last column. From Fig. 9, it is clear

that GBTs are better on BU2.

We also experiment with CNNs augmented to use meta-

data while respecting the convolutional constraints on title

text, however, the performance improved only marginally. It

is not immediately clear why all the classifiers suffer on the

“CDs and Vinyl” category, which has more than 500 branches

– see Fig. 10. The AMZ dataset also suffers from novel

cases of data imbalance. For instance, most of the listings

in “Books” and “Grocery” are in one branch, with most other

branches containing less than 10 listings. This might be due

3892

Dataset NB LogReg ElasticNet LogReg L1 GBT CNN w pretraining Mean KL

BU1 81.45 86.30 86.75 89.03* 89.12* 0.872

BU2 68.21 84.29 85.01 90.63* 88.67 0.715

AMZ 49.01 69.39 66.65 67.17 72.66* 1.654

TABLE II: Mean micro-precision on 10% test set from BU1, BU2 and AMZ datasets

to the use of a presumed “navigational taxonomy” in the

crawled AMZ dataset [5]. However, we do not have Amazon’s

internal product taxonomy organization to validate the claim.

In summary, from both Figs. 9 and 10, we observe that GBTs

and CNNs with pre-training perform best even in extreme data

imbalance. It is possible that GBTs need finer parameter tuning

per top-level subtree for datasets resembling AMZ.

F. Hardware Setup and Empirical Runtime Analysis

Finally, we briefly touch upon the hardware setup and

the empirical runtime analysis of the three best classifiers.

We report training and prediction runtimes for the largest

dataset: BU2. For the logistic regression with L1 regularization

classifier (LogReg), we use the single-core implementation

provided with [26]2. For experiments with GBTs, we used

the open source multi-core implementation described in [8]3.

We modified the CNN source code from [10]4 to use the

TensorFlow framework [22] and ran it in multi-core modewithin a single server with multi GPU support. For our

experiments, we have made use of two types of machines:

1) c4.8x Large instances in Amazon EC25. Each instance

consists of 36 virtual CPUs (vCPUs) with 60 GB memory.

Each physical CPU is a 2.6 GHz Intel Xeon E5-2666 v3

(Haswell) processor and is hyper-threaded to yield two

vCPUs.

2) Custom GPU server with 8 NVidia Titan X GPUs6.

Additionally, the server has two 3.40 GHz Intel Xeon

CPUs to communicate with the eight GPUs.

We observe from Fig. III that the single core implementation

of the LogReg classifier is the fastest to train and predict using

just one core CPU. This is expected of a linear model. On

average, the training time is 0.8, 3.0, and 26 ms per instance

for LogReg, CNN, and GBT respectively. The prediction time

is 0.07, 0.025 and 0.2 ms per instance for LogReg, CNN and

GBT respectively. However, the prediction times for CNN and

GBTs reported in Table III are for batch predictions only.

The GBT implementation described in [8] does not scale well

during online prediction without replicating the models in

memory. The prediction time goes down to 40–20 instances

classifications per second when the models are large, such as

those consisting of several hundred trees and occupy more

2http://www.csie.ntu.edu.tw/∼cjlin/liblinear/3https://github.com/dmlc/xgboost4https://github.com/yoonkim/CNN sentence5https://aws.amazon.com/ec2/instance-types/6http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-x/

specifications

Training/Prediction

Classifier Time perinstance(ms)

Percentagedecrementw.r.t GBT

Training

GBT 26 -

CNN 3.0 88.5%

LogReg L1 0.8 97.0%

Prediction

GBT 0.2 -

CNN 0.025 87.5%

LogReg L1 0.07 65.0%

TABLE III: Average training and prediction runtime in ms

per instance for three best classifiers on the BU2 dataset. The

figures are averaged over those obtained for only the fifteen

top-level subtree categories.

Classifier Average micro-precision

Absolute percentageincrement w.r.tLogReg L1

GBT 90.47% 5.33%

CNN 88.99% 3.85%

LogReg L1 85.14% -

TABLE IV: Average micro-precision improvements for three

best classifiers on the BU2 dataset. The figures are averaged

over those obtained for only the fifteen top-level subtree

categories.

than 500 MB to over 1 GB of disc space. Such large models

are common for production-grade e-commerce catalogs. For

instance, the ones we experiment with in this paper for BU2

dataset have the maximal tree depth set to 500 as dictated by

cross-validation experiments.

We observe from Table III, that the average reduction in

runtime while training for the single core LogReg model

is 97% and for 8-GPU CNN is 88.5% over the 36-vCPU

CPU scale-up of GBT implementation. The reduction in batch

prediction times are similar given the GBT reference point,

with CNNs outperforming LogReg due to native processor

architecture support for fast vector space operations.

The trade-off between runtime and accuracy is the key factor

to consider while deploying models to a production system. In

our environment, both training and prediction is done offline

(not-realtime) whereby in the latter phase, new product listings

are augmented with the predicted taxonomy branches before

they are indexed by the search engine. The “offline” prediction

3893

is handled by a parallel scale-up of classification API servers

which communicate to a data pipeline system implemented in

Spark. The main bottleneck for runtime in our classification

scheme arise due to a very large top-level GBT model. It is

possible to use cheaper classifiers for some category subtrees

(see Figs. 4, 9 and 10), however, for production systems in-

volving millions of listings being classified every day, superior

classification performance in terms of mean precision and F1,

usually takes precedence given available resources. As such,

the numbers from Tables III and IV provide the management

with enough clues for cost-benefit analysis on which modeling

technique and corresponding hardware to choose.

VI. CONCLUDING REMARKS

Large-scale taxonomy categorization is a challenging task

with noisy and imbalanced data. We demonstrate deep learning

and gradient tree boosting models with operational robustness

in real industrial settings on e-commence catalogs reaching

millions of items. We summarize our contributions as fol-

lows: 1) We study the effectiveness of large-scale product

taxonomy categorization on top-level category subtrees and

conclude that GBTs and CNNs can be used as new state-of-

the-art baselines; 2) We quantify the nature of imbalance for

different product datasets in terms of distributional divergence

and correlate that to prediction performance; 3) Using the

correlation of divergence measures for data imbalance and

prediction accuracies, we can guess expected performance

even before training an expensive classifier; 4) We also show

evidence to suggest that words from product titles, after

removal of stop words and rare words, together with leaf

nodes from navigational breadcrumbs and list prices, when

available, can boost categorization performance significantly

on product datasets. Finally, 5) we showcase a novel use

of topic models with minimal human intervention to clean

large amounts of noise particularly when the source of noise

cannot be controlled. This is unlike any experiment reported

in previous publications on product categorization. Automatic

topic labeling with a pretrained classifier for a given category

from another dataset can help construct an initial taxonomy

over listings for which none exist – a major benefit for

reducing manual efforts for initial taxonomy creation.

REFERENCES

[1] V. Ganti, A. C. Konig, and X. Li, “Precomputing search features forfast and accurate query classification,” in Proceedings of the Third ACMInternational Conference on Web Search and Data Mining, ser. WSDM’10, 2010.

[2] C. Sun, N. Rampalli, F. Yang, and A. Doan, “Chimera: Large-scaleclassification using machine learning, rules, and crowdsourcing,” Proc.VLDB Endow., vol. 7, no. 13, Aug. 2014.

[3] D. Shen, J.-D. Ruvini, and B. Sarwar, “Large-scale item categorizationfor e-commerce,” in Proceedings of the 21st ACM International Con-ference on Information and Knowledge Management, ser. CIKM ’12.New York, NY, USA: ACM, 2012, pp. 595–604.

[4] H. Pyo, J.-W. Ha, and J. Kim, “Large-scale item categorization in e-commerce using multiple recurrent neural networks,” in Proceedingsof the 22nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, ser. KDD ’16. New York, NY, USA:ACM, 2016.

[5] J. McAuley, R. Pandey, and J. Leskovec, “Inferring networks of substi-tutable and complementary products,” in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery and DataMining, ser. KDD ’15. New York, NY, USA: ACM, 2015, pp. 785–794.

[6] D. Shen, J. D. Ruvini, M. Somaiya, and N. Sundaresan, “Item catego-rization in the e-commerce domain,” in Proceedings of the 20th ACMInternational Conference on Information and Knowledge Management,ser. CIKM ’11. New York, NY, USA: ACM, 2011, pp. 1921–1924.

[7] J. H. Friedman, “Greedy function approximation: A gradient boostingmachine,” Annals of Statistics, vol. 29, pp. 1189–1232, 2000.

[8] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”in Proceedings of the 22Nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, ser. KDD ’16. NewYork, NY, USA: ACM, 2016, pp. 785–794. [Online]. Available:http://doi.acm.org/10.1145/2939672.2939785

[9] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,and time-series,” in The Handbook of Brain Theory and Neural Net-works, M. A. Arbib, Ed. MIT Press, 1995.

[10] Y. Kim, “Convolutional neural networks for sentence classification,”in Proceedings of the 2014 Conference on Empirical Methods inNatural Language Processing (EMNLP). Doha, Qatar: Associationfor Computational Linguistics, October 2014, pp. 1746–1751.

[11] H.-F. Yu, C. hua Ho, P. Arunachalam, M. Somaiya, and C. jen Lin,“Product title classification versus text classification,” UTexas, Austin;NTU; EBay, Tech. Rep., 2013.

[12] R. Bekkerman and M. Gavish, “High-precision phrase-based documentclassification on a modern scale,” in Proceedings of the 17th ACMSIGKDD International Conference on Knowledge Discovery and DataMining, ser. KDD ’11. New York, NY, USA: ACM, 2011, pp. 231–239.

[13] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of StatisticalLearning: Data Mining, Inference, and Prediction. Springer, Aug.2003.

[14] T. Zhang, “Solving large scale linear prediction problems using stochas-tic gradient descent algorithms,” in ICML 2004: Proceedings of the 21stInternational Conference on Machine Learning, 2004, pp. 919–926.

[15] H. Zou and T. Hastie, “Regularization and variable selection via theelastic net,” Journal of the Royal Statistical Society, Series B, vol. 67,pp. 301–320, 2005.

[16] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in Advances in Neural Information Processing Systems 26:27th Annual Conference on Neural Information Processing Systems2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe,Nevada, United States., 2013, pp. 3111–3119.

[17] D. Shen, J.-D. Ruvini, R. Mukherjee, and N. Sundaresan, “A studyof smoothing algorithms for item categorization on e-commerce sites,”Neurocomput., vol. 92, pp. 54–60, Sep. 2012.

[18] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,vol. 20, no. 3, pp. 273–297, Sep. 1995.

[19] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Infor-mation Retrieval. New York, NY, USA: Cambridge University Press,2008.

[20] D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty, “Latent dirichletallocation,” Journal of Machine Learning Research, vol. 3, p. 2003,2003.

[21] T. M. Cover and J. A. Thomas, Elements of Information Theory. NewYork, NY, USA: Wiley-Interscience, 1991.

[22] M. Abadi et al., “Tensorflow: Large-scale machine learning onheterogeneous distributed systems,” 2015. [Online]. Available: http://download.tensorflow.org/paper/whitepaper2015.pdf

[23] X. Glorot and Y. Bengio, “Understanding the difficulty of training deepfeedforward neural networks,” in In Proceedings of the InternationalConference on Artificial Intelligence and Statistics (AISTATS10). Societyfor Artificial Intelligence and Statistics, 2010.

[24] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available:http://arxiv.org/abs/1412.6980

[25] A. Ng and M. Jordan, “On Discriminative vs. Generative Classifiers: AComparison of Logistic Regression and Naive Bayes,” in Advances inNeural Information Processing Systems (NIPS), vol. 14, 2001.

[26] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,“Liblinear: A library for large linear classification,” J. Mach. Learn.Res., vol. 9, pp. 1871–1874, Jun. 2008.

3894

[27] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, andD. McClosky, “The Stanford CoreNLP natural language processingtoolkit,” in Association for Computational Linguistics (ACL) SystemDemonstrations, 2014, pp. 55–60.

[28] D. P. Putthividhya and J. Hu, “Bootstrapped named entity recognitionfor product attribute extraction,” in Proceedings of the Conference onEmpirical Methods in Natural Language Processing, ser. EMNLP ’11.Stroudsburg, PA, USA: Association for Computational Linguistics, 2011,pp. 1557–1567.

[29] P. Das, R. Srihari, and Y. Fu, “Simultaneous joint and conditionalmodeling of documents tagged from two perspectives,” in Proceedings ofthe 20th ACM International Conference on Information and KnowledgeManagement, ser. CIKM ’11. New York, NY, USA: ACM, 2011, pp.1353–1362.

[30] H.-F. Yu, C.-H. Ho, Y.-C. Juan, and C.-J. Lin, “LibShortText: A Libraryfor Short-text Classication and Analysis,” Department of ComputerScience, National Taiwan University, Taipei 106, Taiwan, Tech. Rep.,2013.

[31] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:Synthetic minority over-sampling technique,” J. Artif. Int. Res., vol. 16,no. 1, pp. 321–357, Jun. 2002.

Documents

2016 IEEE International Conference on Big Data (Big Data ... · new shoe listings to an online catalog infrastructure, which then organizes the listings into a taxonomy tree. When