Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
2016 IEEE International Conference on Big Data (Big Data)
978-1-4673-9005-7/16/$31.00 ©2016 IEEE 3885
Large-Scale Taxonomy Categorization for NoisyProduct Listings
Pradipto Das, Yandi Xia, Aaron Levine, Giuseppe Di Fabbrizio, and Ankur DattaRakuten Institute of Technology, Boston
Rakuten USA
{pradipto.das,ts-yandi.xia,aaron.levine,giuseppe.difabbrizio,ankur.datta}@rakuten.com
Abstract—E-commerce catalogs include a continuously grow-ing number of products that are constantly updated. Each itemin a catalog is characterized by several attributes and identifiedby a taxonomy label. Categorizing products with their taxonomylabels is fundamental to effectively search and organize listingsin a catalog. However, manual and/or rule based approachesto categorization are not scalable. In this paper, we compareseveral classifiers to product taxonomy categorization of top-level categories. We first investigate a number of feature setsand observe that a combination of word unigrams from productnames and navigational breadcrumbs work best for categorization.Secondly, we apply correspondence topic models to detect noisydata and introduce a lightweight manual process to improvedataset quality. Finally, we evaluate linear models, gradientboosted trees (GBTs) and convolutional neural networks (CNNs)with pre-trained word embeddings demonstrating that, comparedto other baselines, GBTs and CNNs yield the highest gains inerror reduction.
I. INTRODUCTION
Product taxonomy categorization is a key factor in exposing
products of merchants to potential online buyers. Most catalog
search engines use taxonomy labels to optimize query results
and match relevant listings to users’ preferences [1]. To illus-
trate the general concept, consider Fig. 1. Nordstrom pushes
new shoe listings to an online catalog infrastructure, which
then organizes the listings into a taxonomy tree. When a user
searches for the comfort shoes “Lyza Clog”, the search engine
first has to understand that the user is searching for a “comfortshoes” category. Then, if the specific shoe product cannot
be found in the inventory, relevant shoes in the “Comfort”category are shown in the search results to encourage the
user to browse further. In addition to improving the quality
of product search, good categorization also plays a critical
role in targeted advertising, personalized recommendations
and product clustering. However, there are multiple challenges
in achieving good product categorization.
Commercial product taxonomies are organized in tree struc-
tures, three to ten levels deep, with thousands of leaf nodes
[2, 3, 4, 5]. The breadth of such taxonomies lead to unavoid-
able human errors while entering data. Even a site with an
unified taxonomy where merchants directly upload listings
reported a 15% error rate in categorization [3]. For example,
in Fig. 1, a merchant has selected a wrong label, “Sneakers”,
for “1883 by Wolverine Women’s Maisie Oxford Tan/Taupe
Leather/Suede”, which contributes to labeling noise in training
data. Further, most e-commerce companies receive millions of
new listings per month from hundreds of merchants composed
of wildly different formats, descriptions, prices and meta-data
for the same products. For instance, even an identical product
can be listed as “Wilson tennis racket Level III signed byFederer” for $80 by one merchant and “Wilson tennis racquetLevel 3 Roger Federer TM” for $72 by another. A robust
classifier has to assign the same taxonomy label to such listings
(e.g., “Sports & Outdoors > Fitness > Tennis > Rackets”)
regardless of these issues.
Current e-commerce systems trade-off between accuracies
obtained by classifying a listing directly into thousands of leaf
node categories [2, 4] and splitting the taxonomy at predefined
depths [6, 3] in favor of building smaller subtree models. For
the latter scenario, there is usually another trade-off between
the number of hierarchical subtrees, the runtime latency intro-
duced by multiple categorization over the cascading subtrees
and the snowballing of error propagated in the prediction
chain. In this paper, similarly to [3], we classify product
listings in two steps: first we predict the top-level category and
then classify the listings using the subtree models selected by
the top level predictions. We use a publicly available Amazon
dataset [5] and two in-house datasets1 for our large-scale
taxonomy categorization experiments on product listings.
This paper makes several contributions to address such
large, noisy product datasets: 1) We perform large scale com-
parisons with several robust classification methods and show
that Gradient Boosted Trees (GBTs) [7, 8] and Convolutional
Neural Networks (CNNs) [9, 10] perform substantially better
than state-of-the-art linear models using word unigram features
(Section V). We further provide analysis of their performance
w.r.t. imbalance in product datasets. 2) We demonstrate that
using both listing price and navigational breadcrumbs – the
branches that merchants assign to the listings in web pages
for navigational purposes – boost categorization performance
(Section V-C) and 3) We effectively apply correspondence
topic models to detect and remove mislabeled instances in
training data with minimal human intervention (Section V-D).
1The in-house datasets are from Rakuten USA Inc. managed by RakutenIchiba, Japan’s largest e-commerce company.
3886
Fig. 1: E-commerce platform using taxonomy categorization to understand query intent, match merchant listings to potential
buyers as well as to prevent buyers from navigating away on search misses.
II. RELATED WORK
Our problem domain is similar to the one explored in
[2, 6, 3, 11, 4, 12], but with more pronounced data quality
issues. The work in [2] emphasizes the use of simple classifiers
in combination with large-scale manual efforts to reduce noise
and imperfections from categorization outputs. While human
intervention is important, we show how unsupervised topic
models can substantially reduce such expensive efforts for
product listings crawled in the wild. Further, [2] sheds no
light on the contribution of each classifier to the categorization
problem. In response, we adopted stronger baseline systems
based on regularized linear models [13, 14, 15].
A recent work from [4] emphasizes the use of recurrent neu-
ral networks for taxonomy categorization purposes. Although,
they mention that RNNs render unlabeled pre-training of word
vectors [16] unnecessary, their embedding only depends on a
recursive representation of the same initial feature space. In
contrast, we show that training word embeddings on the whole
set of three product title corpora improves performance for
CNN models and opens up the possibility of leveraging other
product corpora when available. However, unlike the recent
claims in [2] and [17], both [4] and our experiments, show
the ineffectiveness of the Naıve Bayes classifiers.
Finally, [3] advocate the use of algorithmic splitting of
the taxonomy structure using graph theoretic latent group
discovery to mitigate data imbalance problems at the leaf
nodes. They use a combination of k-NN classifiers at the
coarser level and SVMs [18] classifiers at the leaf levels. Their
SVMs solve much easier k-way multi-class categorization
problems where k ∈ {3, 4, 5} with much less data imbalance.
We, however, have found that SVMs do not work well in
scenarios where k is large and the data is imbalanced. Due to
our high dimensional feature spaces, we avoided k-NN classi-
fiers that can cause prohibitively long prediction times under
arbitrary feature transformations [19]. Further, we make use
of probabilistic latent topics to eliminate mislabeled listings
in a novel way that would otherwise render the latent group
discovery in [3] ineffective on our in-house datasets.
III. DATASET CHARACTERISTICS
We use two in-house datasets, named BU1 and BU2, and
one publicly available Amazon dataset (AMZ) [5] for the
experiments in this paper. BU1 is categorized using human
annotation efforts and rule-based automated systems. This
leads to a high precision training set at the expense of
coverage. On the other hand, for BU2, noisy taxonomy labels
from external data vendors have been automatically mapped
to an in-house taxonomy without any human error correction,
resulting in a larger dataset at the cost of precision. BU2 also
suffers from inconsistencies in product titles and meta-data that
are incomplete or malformed due to errors in the web crawlers
that vendors use to aggregate new listings. In our experiments
on the BU2 dataset, the noise is distributed identically in the
training and test sets, thus evaluation of the classifiers is not
impeded by it.
Datasets Subtrees Branches Listings PCC KL
BU1 16 1,146 12.1M 0.643 0.872
BU2 15 571 60M 0.209 0.715
AMZ 25 18,188 7.46M 0.269 1.654
TABLE I: Dataset properties on: total number of top-level
category subtrees, branches and listings.
The BU2 dataset has the noisiest ground-truth labels where
product listings have been assigned to incorrect labels. How-
ever, as the manual verification of millions of listings is
infeasible, using some proxy for ground truth is a viable
alternative that has previously produced encouraging results
[3]. To mitigate manual corrective efforts, we resort to using
topic models to detect incorrect listings (Section V-D). Topic
models such as LDA [20] have the advantage of finding latent
structure in data without the use of labels. In addition to
noisy ground truth labels, product taxonomies often exhibit
imbalanced data distribution.
3887
(a) BU1 dataset (b) AMZ dataset. Vertical scale of the top Fig. is 10xof the top Figs. 2a and 2c
(c) BU2 dataset
Fig. 2: Imbalance in evaluation datasets: The top row shows the number of branches with non-zero number of listings.
The bottom row shows the KL divergence of the empirical distribution of listings in each subtree compared to the uniform
distribution.
Fig. 3: Top-level category distribution of 40 million dedupli-
cated listings from Dec 2015 snapshot of BU2. Each category
subtree is also imbalanced, as seen in the exploded view of
the “Furniture” category.
Figure 3 shows the top-level “Home, Furniture and Patio”subtree that accounts for almost half of the BU2 dataset. The
first row in Fig. 2 shows the number of branches for top-
level taxonomy subtrees for three datasets. To measure the
level of data imbalance, we calculated the Pearson correlation
coefficient (PCC) between the number of listings and branches
in each of the top-level subtrees (Table I). A perfectly balanced
tree will have a PCC = 1.0. BU1 (Fig. 2a) shows the most
benign kind of imbalance with a PCC = 0.643. This confirms
that the number of branches in the subtrees correlate well
with the volume of listings. AMZ shows the highest average
number of branches in the subtrees. This is likely due to the
crawling technique used to harvest the data, which we believe
differs from Amazon’s internal catalog. For AMZ and BU2
in particular, the number of branches in the subtrees do not
correlate well with the volume of listings, indicating a much
higher level of imbalance.
The bottom row in Fig. 2 shows the average Kullback-
Leibler (KL) divergence, KL(p(x)|q(x)), [21] between the
empirical distribution over listings in branches for each sub-
tree, p(x), compared to a uniform distribution, q(x). Here, the
KL divergence acts as a measure of imbalance of the listing
distribution and is indicative of the categorization performance
that one may obtain on a dataset; high KL divergence leads
to poorer categorization and vice-versa (see Section V).
IV. GRADIENT BOOSTED TREES AND CONVOLUTIONAL
NEURAL NETWORKS
We observed in our experiments that the strongest perform-
ing models were GBTs [7] and CNNs [9, 10]. GBTs optimize
a loss functional: L = Ey[L(y, F (x)|X)] where F (x) can be
a mathematically difficult to characterize function, such as a
decision tree f(x) on X. The optimal value of the function is
expressed as
F �(x) =M∑
m=0
fm(x,a) (1)
where f0(x, a) is the initial guess and {fm(x,a)}Mm=1 are
incremental “boosts” on x defined by the optimization method,
with am as parameter of f(xm). To compute F �(x), we also
need to calculate
am = argmina
N∑i=1
L(yi, Fm(xi)) (2)
3888
with
Fm(x) = Fm−1(x) + fm(x, am) (3)
This expression looks very similar to:
Fm(x) = Fm−1(x)+ρm
(−[∂L(y, F (x))
∂F (x)
]F (xi)=Fm−1(xi)
)(4)
Where ρm is the step length and[∂L(y, F (x))
∂F (x)
]F (xi)=Fm−1(xi)
= gm(xi) (5)
being the search direction. To solve am, we make the basis
functions fm(xi; a) correlate most to
−[∂L(yi, F (xi))
∂F (xi)
]F (xi)=Fm−1(xi)
(6)
Where the gradients are constrained to be defined over the
training data distribution. It can be shown that
am = argmina
N∑i=1
(−gm(xi)− ρfm(xi, a))2
(7)
and
ρm = argminρ
N∑i=1
L(yi, Fm−1(xi) + ρfm(xi; am)) (8)
which yields
Fm(x) = Fm−1(x) + ρmfm(x,am) (9)
Additional details are described in [7].
In GBTs, the optimal selection of decision variables, value
splits, and means at leaf nodes (i.e., am) is based on optimizing
the fm(x, a) using a logistic loss. For GBTs, each decision
tree is robust to imbalance data and outliers [13], and F (x)can approximate arbitrarily complex decision boundaries in a
piecewise manner.
The neural network we use is based on the CNN architecture
described in [10] using the TensorFlow framework [22]. As
in [10], we enhance the performance of “vanilla” CNNs
(Fig. 4 right) using word embedding vectors [16] trained
on the product titles from all datasets, without taxonomy
labels. Context windows of width n, corresponding to n-
grams and embedded in a 300 dimensional word embedding
space, are convolved with L filters followed by rectified non-
linear unit activation and a max-pooling operation over the
set of all windows W . This operation results in a L × 1vector, which is then connected to a softmax output layer of
dimension K × 1, where K is the number of classes. The
filters are initialized with Xavier initialization [23]. We use
mini-batch stochastic gradient descent with Adam optimizer
[24] to perform parameter optimization.
The CNN model tries to allocate as few filters to the
context windows while balancing the constraints on the back-
propagation of error residuals w.r.t. cross-entropy loss L =−∑K
k=1 qk log pk, where pk is the probability of a product
title x belonging to class k predicted by our model, and
q ∈ {0, 1}K is a one-hot vector that represents the true label
of title x. This results in a higher predictive power for the
CNNs, while still matching complex decision boundaries in a
smoother fashion than GBTs.
V. EXPERIMENTAL SETUP AND RESULTS
We use Naıve Bayes (NB) [25] similar to the approach
described in [17] and [2], and Logistic Regression (LogReg)
classifiers with L1 [26] and Elastic Net regularization, as
robust baselines.
The objective functions of both GBTs and CNNs involve L2
regularizers over the set of parameters. Our development set
for parameter tuning is generated by randomly selecting 10%
of the datasets for listings under the “apparel / clothing” cat-
egories. The optimized parameters obtained from this scaled-
down configuration is extended to all other classifiers as well
to reduce experimentation time for more time-consuming clas-
sification models such as GBTs and CNNs. After parameter
tuning, we set a linear combination of 15% L1 regularization
and 85% L2 regularization for Elastic Net. For GBTs, we
limit each decision tree growth to a maximum depth of 500.
For CNNs, we use context window widths of sizes 1, 3, 4, 5 for
four convolution filters, a batch size of 1024 and an embedding
dimension of 300. The parameters for the embeddings are non-
static. LogReg classifiers and CNN need data to be normalized
along each dimension, which is not needed for NB and GBT.
A. Data Preprocessing
BU1 is exclusively comprised of product titles, hence, our
features are primarily extracted from these titles. For AMZ
and BU2, we additionally extract the list price whenever
available. For BU2, we also use the leaf node of any available
navigational breadcrumbs. In order to decrease training and
categorization run times, we employ a number of vocabulary
filtering methods. Further, English stop-words and rare tokens
that appear in ten listings or less are then filtered out. This
reduces vocabulary sizes by up to fifty percent, without a
significant reduction in categorization performance. For CNNs,
we replace numbers by the nominal form [NUM] and removed
rare tokens. We also remove punctuations and all text is low-
ercased. Parts of speech (POS) tagging using a generic tagger
from [27] trained on English text produced very noisy features,
as is expected for out-of-domain tagging. Consequently, we do
not use these features due to the absence of a POS training
set for listings as in [28]. We also experimented with word
embedding features from Word2Vec model [16], for instance,
to constrain variants of words like “t-shirts” in one embedding
dimension, however, the overall results have not been better.
B. Initial Experiments on BU1 dataset
Our initial experiments use unigram counts and three other
indicators – word bigram counts, bi-positional unigram counts,
and bi-positional bigram counts. Consider a title text “120 gbhdd 5400rpm sata fdb 2 5 mobile” from the “Data storage”leaf node of Electronics taxonomy subtree and another title
3889
Fig. 4: Classifier performance on BU1 test set. The CNN classifier has only one configuration and thus shows constant curves
in all plots. Left figure shows prediction on 10% test set using word unigram count features; middle figure shows prediction
on 10% test set using word bigram bi-positional count features; and the right figure shows mean micro-precision over different
feature setups except CNNs. In all figures, “OvO” means “One vs. One” and “OvA” means “One vs All”.
text “acer aspire v7 582pg 6421 touchscreen ultrabook 15 6full hd intel i5 4200u 8gb ram 120 gb hdd ssd nvidia geforcegt 720m” from the “Laptops and netbooks” leaf node. In such
cases, we observe that merchants tend to place terms pertaining
to storage device specifics in the front of product titles for
“Data storage” and similar terms towards the end in the titles
for “Laptops”. As such, we split the length of the title in half
and augment word uni/bigrams with a left/right-half position.
This makes sense from a Naıve Bayes point of view, since
signals like “120 gb”[Left Half], “gb hdd”[Left Half],
“120 gb”[Right Half] and “gb hdd”[Right Half] de-
correlates the feature space better, which is suitable for the
naıve assumption in NB. This also helps in sightly better
explanation of the class posteriors. These assumptions for NB
are validated in the three figures: Fig. 4 left, Fig. 4 middleand Fig. 4 right. Word unigram count features perform
strongly for all classifiers except NB, whereas bi-positional
word bigram features helped only NB significantly.
Additionally, the micro-precision and F1 scores for CNNs
and GBTs are significantly higher compared to other algo-
rithms on word unigrams using paired t-test with a p-value <0.0001. The performances of GBTs and LogReg L1 classifiers
deteriorate over the other feature sets as well. The bi-positional
and bigram feature sets also do not produce any improvements
for the AMZ dataset. Based on these initial experiments, we
focus on word unigrams in all of our subsequent experiments.
C. Categorization Improvements with Navigational Bread-crumbs and List Prices on BU2 Dataset
BU2 is a challenging dataset in terms of class imbalance and
noise and we sought to improve categorization performance
using available meta-data. To start, we experiment with a
smaller dataset consisting of ≈ 500, 000 deduplicated listings
under the “Women’s Clothing” taxonomy subtree, extracted
from our Dec 2015 snapshot of 40 million records. We then
test against ≈ 2.85 million deduplicated “Women’s Clothing”listings from the Feb 2016 BU2 dataset. In all experiments,
10% of the data is used as test set.
The first noteworthy fact in Fig. 5 is that the micro-precision
and F1 of the GBTs substantially improve after increasing the
size of the dataset. Further, stop words and rare words filtering
decrease precision and F1 by less than 1%, despite halving the
feature space. The addition of navigational leaf nodes and list
prices prove advantageous, with both features independently
boosting performance and raising micro-precision and F1 to
over 90%. Despite finding similar gains in categorization
performance for other top-level subtrees by using these meta
features, we needed a system to filter mis-categorized listings
from our training data as well.
Fig. 5: Improvements in micro-precision and F1 for GBTs on
BU2 dataset for “Women’s Clothing” subtree
3890
Fig. 6: Top few most probable words under the
latent “noise” topics over listings in “Shoes” sub-
tree. Human annotators inspect whether such sets
of words belong to a Shoes topic or not.
Fig. 7: Interpretation of latent topics using predictions
from a GBT classifier. The topics here are excluding
those in Fig. 6 and are all from Feb 2016 snapshot of
the BU2 dataset.
ρ
Fig. 8: Correspondence MMLDA model for text [29].
We also validated the fact that list prices and the leaf node
text from breadcrumbs improve classification performance by
observing the highest weighted features obtained from the
GBT model. For instance, the top thousand features from
the model for top-level classification has seven percent of
features that are breadcrumbs. The list price feature also has
been ranked in the three hundredth place out of a total of
around five hundred thousand features. Qualitative inspection
of the top one hundred features from the top-level model also
revealed an abundance of terms from the “Jewelry & Watches”
category. This observation also correlated well with the fact
that the test set performance for the top-level “Jewelry &
Watches” category has also been the highest with an yield
of 98% precision and F1.
D. Noise Analysis of BU2 Dataset using Correspondence LDAfor Text
When categorization of listings in the “Shoes” subtree
performed over 25 points below “Women’s Clothing”, as seen
in Fig. 11, we theorized that the incorrect label assignments
are to blame. There are over 34 million “Shoes” listings in
BU2 dataset and thus a manual scan is infeasible unlike [2].
To address this problem, we compute p(x) over latent topics
zk, and automatically annotate the most probable words over
each topic.
We choose the CorrMMLDA model (Fig. 8) from [29]
to discover the latent topical structure of the listings in the
Shoes category because of two reasons. Firstly, the model is
a natural choice for our scenario – it is intuitive to assume
that store and brand names (e.g. the store “QVC” and brand
“Adam Tucker” – words wd,m in the M plate of listing d in
Fig. 8) are distributions over words in titles (e.g. the words
in “Stretch criss cross strap sandals” – words wd,n in the
N plate of the same listing d in Fig. 8). The title words
are in turn distributions over the latent topics zd for listing
d ∈ {1..D}. Secondly, the CorrMMLDA model has been
shown to exhibit lower held out perplexities indicative of
improved topic quality. The reason behind the lower perplexity
stems from the following observations:
Using the notation in [29], we denote the free parameters
of the variational distributions over words in brand and store
names, say λd,m,n, as Multinomials over words in the titles
and those over words in the title, say φd,n,k, as Multinomials
over latent topics zd. It is easy to see that the posterior over
the topic zd,k for each wd,m of brand and store names, is
dependent on λ and φ through∑Nd
n=1 λd,m,n × φd,n,k. This
means that if a certain topic zd = j generates all words in the
title, i.e., φd,n,j > 0, then only that topic also generates the
brand and store names thereby increasing likelihood of fit and
reducing perplexity. The other topics zd �= j do not contribute
towards explaining the topical structure of the listing d. To
this end, the correspondence view, wNd , is made up of word
unigrams in the title, and the content view, wMd , consists of
brand and merchant names, both as bags-of-words.
We train the CorrMMLDA model with K = 100 latent
topics. A sample of nine latent topics and their most probable
words shown in Fig. 6 demonstrates topics outside of the
“Shoes” domain can be manually identified, while reducing
human annotation efforts from 3.4 million records to one
hundred. We choose K = 100 since it is roughly twice
the number of branches for the Shoes subtree. This choice
provides modeling flexibility while respecting the number of
ground truth classes.
We next run a list of the most probable six words, the
average length of a “Shoes” listing’s title, from each latent
topic through another GBT classifier trained on the full, noisy
data for the “Shoes” category. The GBT classifier used for this
3891
Fig. 9: Micro-precision on 10% test set from the
BU2 dataset across top-level categories.
Fig. 10: Micro-precision on 10% test set from the AMZ dataset
across top-level categories.
Fig. 11: Micro-precision and F1 across fifteen top-level
categories on 10% (4 million listings) of Dec 2015 BU2
snapshot.
purpose has been trained only on the words from the product
titles but without the list prices and the leaf node text from
the breadcrumbs. The reason is that the short descriptions of
the latent topics, which we want to classify, resemble short
bag-of-word queries without any metadata.
Categories that mismatch their topics are manually labeled
ambiguous, as seen in the bottom two rows in Fig. 7. As a
final confirmation, we uniformly sampled a hundred listings
from each topic judged to be ambiguous. Inspections revealed
numerous listings from merchants that do not sell shoes are
cataloged in the “Shoes” subtree due to vendor error. To
this end, we remove listings corresponding to such “out-of-category” merchants from all top-level categories.
To summarize, by manually inspecting K×6 most probable
words from the K = 100 topics and J × 100 listings,
where J << K, instead of 3.4 million, a few annotators
accomplished in hours what would have taken hundreds of
annotators several months according to the estimates in [2].
E. Results on BU2 and AMZ Datasets
Here we first report categorization performance of our
LogReg L1 baseline on the earlier Dec 2015 snapshot of BU2
dataset using just the words in the titles excepting stop words.
In section V-B, we have shown the efficacy of word unigram
features on the BU1 dataset. Figure 11 shows that LogReg
with L1 regularization [11, 30] initially achieves 83% mean
micro-precision and F1 on the initial BU2 dataset. This had
fallen short of our expectation of achieving an overall 90%
precision (red line in Fig. 11), but forms a robust baseline for
our subsequent experiments with the AMZ and the cleaned
BU2 datasets. We additionally use list price and leaf nodes of
navigational breadcrumbs for BU2 dataset and list price for
AMZ dataset when available.
Overall, Naıve Bayes, being an overly simplified generative
model, generalizes very poorly on all datasets (see Figs. 4,
9 and 10). It may be possible to improve NB’s performance
using sub-sampling techniques [31]; however, sub-sampling
can have its own problems for product datasets [2].
We observe from Table II, that when log(N/B), where Nis the number of listings and B is the number of categories, is
relatively high, – 11.54 for BU2 and 9.27 for BU1, compared
to 6.02 for AMZ, most classifiers tend to perform well. Figures
with a ∗ are statistically better than other non-starred ones in
the same row except the last column. From Fig. 9, it is clear
that GBTs are better on BU2.
We also experiment with CNNs augmented to use meta-
data while respecting the convolutional constraints on title
text, however, the performance improved only marginally. It
is not immediately clear why all the classifiers suffer on the
“CDs and Vinyl” category, which has more than 500 branches
– see Fig. 10. The AMZ dataset also suffers from novel
cases of data imbalance. For instance, most of the listings
in “Books” and “Grocery” are in one branch, with most other
branches containing less than 10 listings. This might be due
3892
Dataset NB LogReg ElasticNet LogReg L1 GBT CNN w pretraining Mean KL
BU1 81.45 86.30 86.75 89.03* 89.12* 0.872
BU2 68.21 84.29 85.01 90.63* 88.67 0.715
AMZ 49.01 69.39 66.65 67.17 72.66* 1.654
TABLE II: Mean micro-precision on 10% test set from BU1, BU2 and AMZ datasets
to the use of a presumed “navigational taxonomy” in the
crawled AMZ dataset [5]. However, we do not have Amazon’s
internal product taxonomy organization to validate the claim.
In summary, from both Figs. 9 and 10, we observe that GBTs
and CNNs with pre-training perform best even in extreme data
imbalance. It is possible that GBTs need finer parameter tuning
per top-level subtree for datasets resembling AMZ.
F. Hardware Setup and Empirical Runtime Analysis
Finally, we briefly touch upon the hardware setup and
the empirical runtime analysis of the three best classifiers.
We report training and prediction runtimes for the largest
dataset: BU2. For the logistic regression with L1 regularization
classifier (LogReg), we use the single-core implementation
provided with [26]2. For experiments with GBTs, we used
the open source multi-core implementation described in [8]3.
We modified the CNN source code from [10]4 to use the
TensorFlow framework [22] and ran it in multi-core modewithin a single server with multi GPU support. For our
experiments, we have made use of two types of machines:
1) c4.8x Large instances in Amazon EC25. Each instance
consists of 36 virtual CPUs (vCPUs) with 60 GB memory.
Each physical CPU is a 2.6 GHz Intel Xeon E5-2666 v3
(Haswell) processor and is hyper-threaded to yield two
vCPUs.
2) Custom GPU server with 8 NVidia Titan X GPUs6.
Additionally, the server has two 3.40 GHz Intel Xeon
CPUs to communicate with the eight GPUs.
We observe from Fig. III that the single core implementation
of the LogReg classifier is the fastest to train and predict using
just one core CPU. This is expected of a linear model. On
average, the training time is 0.8, 3.0, and 26 ms per instance
for LogReg, CNN, and GBT respectively. The prediction time
is 0.07, 0.025 and 0.2 ms per instance for LogReg, CNN and
GBT respectively. However, the prediction times for CNN and
GBTs reported in Table III are for batch predictions only.
The GBT implementation described in [8] does not scale well
during online prediction without replicating the models in
memory. The prediction time goes down to 40–20 instances
classifications per second when the models are large, such as
those consisting of several hundred trees and occupy more
2http://www.csie.ntu.edu.tw/∼cjlin/liblinear/3https://github.com/dmlc/xgboost4https://github.com/yoonkim/CNN sentence5https://aws.amazon.com/ec2/instance-types/6http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-x/
specifications
Training/Prediction
Classifier Time perinstance(ms)
Percentagedecrementw.r.t GBT
Training
GBT 26 -
CNN 3.0 88.5%
LogReg L1 0.8 97.0%
Prediction
GBT 0.2 -
CNN 0.025 87.5%
LogReg L1 0.07 65.0%
TABLE III: Average training and prediction runtime in ms
per instance for three best classifiers on the BU2 dataset. The
figures are averaged over those obtained for only the fifteen
top-level subtree categories.
Classifier Average micro-precision
Absolute percentageincrement w.r.tLogReg L1
GBT 90.47% 5.33%
CNN 88.99% 3.85%
LogReg L1 85.14% -
TABLE IV: Average micro-precision improvements for three
best classifiers on the BU2 dataset. The figures are averaged
over those obtained for only the fifteen top-level subtree
categories.
than 500 MB to over 1 GB of disc space. Such large models
are common for production-grade e-commerce catalogs. For
instance, the ones we experiment with in this paper for BU2
dataset have the maximal tree depth set to 500 as dictated by
cross-validation experiments.
We observe from Table III, that the average reduction in
runtime while training for the single core LogReg model
is 97% and for 8-GPU CNN is 88.5% over the 36-vCPU
CPU scale-up of GBT implementation. The reduction in batch
prediction times are similar given the GBT reference point,
with CNNs outperforming LogReg due to native processor
architecture support for fast vector space operations.
The trade-off between runtime and accuracy is the key factor
to consider while deploying models to a production system. In
our environment, both training and prediction is done offline
(not-realtime) whereby in the latter phase, new product listings
are augmented with the predicted taxonomy branches before
they are indexed by the search engine. The “offline” prediction
3893
is handled by a parallel scale-up of classification API servers
which communicate to a data pipeline system implemented in
Spark. The main bottleneck for runtime in our classification
scheme arise due to a very large top-level GBT model. It is
possible to use cheaper classifiers for some category subtrees
(see Figs. 4, 9 and 10), however, for production systems in-
volving millions of listings being classified every day, superior
classification performance in terms of mean precision and F1,
usually takes precedence given available resources. As such,
the numbers from Tables III and IV provide the management
with enough clues for cost-benefit analysis on which modeling
technique and corresponding hardware to choose.
VI. CONCLUDING REMARKS
Large-scale taxonomy categorization is a challenging task
with noisy and imbalanced data. We demonstrate deep learning
and gradient tree boosting models with operational robustness
in real industrial settings on e-commence catalogs reaching
millions of items. We summarize our contributions as fol-
lows: 1) We study the effectiveness of large-scale product
taxonomy categorization on top-level category subtrees and
conclude that GBTs and CNNs can be used as new state-of-
the-art baselines; 2) We quantify the nature of imbalance for
different product datasets in terms of distributional divergence
and correlate that to prediction performance; 3) Using the
correlation of divergence measures for data imbalance and
prediction accuracies, we can guess expected performance
even before training an expensive classifier; 4) We also show
evidence to suggest that words from product titles, after
removal of stop words and rare words, together with leaf
nodes from navigational breadcrumbs and list prices, when
available, can boost categorization performance significantly
on product datasets. Finally, 5) we showcase a novel use
of topic models with minimal human intervention to clean
large amounts of noise particularly when the source of noise
cannot be controlled. This is unlike any experiment reported
in previous publications on product categorization. Automatic
topic labeling with a pretrained classifier for a given category
from another dataset can help construct an initial taxonomy
over listings for which none exist – a major benefit for
reducing manual efforts for initial taxonomy creation.
REFERENCES
[1] V. Ganti, A. C. Konig, and X. Li, “Precomputing search features forfast and accurate query classification,” in Proceedings of the Third ACMInternational Conference on Web Search and Data Mining, ser. WSDM’10, 2010.
[2] C. Sun, N. Rampalli, F. Yang, and A. Doan, “Chimera: Large-scaleclassification using machine learning, rules, and crowdsourcing,” Proc.VLDB Endow., vol. 7, no. 13, Aug. 2014.
[3] D. Shen, J.-D. Ruvini, and B. Sarwar, “Large-scale item categorizationfor e-commerce,” in Proceedings of the 21st ACM International Con-ference on Information and Knowledge Management, ser. CIKM ’12.New York, NY, USA: ACM, 2012, pp. 595–604.
[4] H. Pyo, J.-W. Ha, and J. Kim, “Large-scale item categorization in e-commerce using multiple recurrent neural networks,” in Proceedingsof the 22nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, ser. KDD ’16. New York, NY, USA:ACM, 2016.
[5] J. McAuley, R. Pandey, and J. Leskovec, “Inferring networks of substi-tutable and complementary products,” in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery and DataMining, ser. KDD ’15. New York, NY, USA: ACM, 2015, pp. 785–794.
[6] D. Shen, J. D. Ruvini, M. Somaiya, and N. Sundaresan, “Item catego-rization in the e-commerce domain,” in Proceedings of the 20th ACMInternational Conference on Information and Knowledge Management,ser. CIKM ’11. New York, NY, USA: ACM, 2011, pp. 1921–1924.
[7] J. H. Friedman, “Greedy function approximation: A gradient boostingmachine,” Annals of Statistics, vol. 29, pp. 1189–1232, 2000.
[8] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”in Proceedings of the 22Nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, ser. KDD ’16. NewYork, NY, USA: ACM, 2016, pp. 785–794. [Online]. Available:http://doi.acm.org/10.1145/2939672.2939785
[9] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,and time-series,” in The Handbook of Brain Theory and Neural Net-works, M. A. Arbib, Ed. MIT Press, 1995.
[10] Y. Kim, “Convolutional neural networks for sentence classification,”in Proceedings of the 2014 Conference on Empirical Methods inNatural Language Processing (EMNLP). Doha, Qatar: Associationfor Computational Linguistics, October 2014, pp. 1746–1751.
[11] H.-F. Yu, C. hua Ho, P. Arunachalam, M. Somaiya, and C. jen Lin,“Product title classification versus text classification,” UTexas, Austin;NTU; EBay, Tech. Rep., 2013.
[12] R. Bekkerman and M. Gavish, “High-precision phrase-based documentclassification on a modern scale,” in Proceedings of the 17th ACMSIGKDD International Conference on Knowledge Discovery and DataMining, ser. KDD ’11. New York, NY, USA: ACM, 2011, pp. 231–239.
[13] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of StatisticalLearning: Data Mining, Inference, and Prediction. Springer, Aug.2003.
[14] T. Zhang, “Solving large scale linear prediction problems using stochas-tic gradient descent algorithms,” in ICML 2004: Proceedings of the 21stInternational Conference on Machine Learning, 2004, pp. 919–926.
[15] H. Zou and T. Hastie, “Regularization and variable selection via theelastic net,” Journal of the Royal Statistical Society, Series B, vol. 67,pp. 301–320, 2005.
[16] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in Advances in Neural Information Processing Systems 26:27th Annual Conference on Neural Information Processing Systems2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe,Nevada, United States., 2013, pp. 3111–3119.
[17] D. Shen, J.-D. Ruvini, R. Mukherjee, and N. Sundaresan, “A studyof smoothing algorithms for item categorization on e-commerce sites,”Neurocomput., vol. 92, pp. 54–60, Sep. 2012.
[18] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,vol. 20, no. 3, pp. 273–297, Sep. 1995.
[19] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Infor-mation Retrieval. New York, NY, USA: Cambridge University Press,2008.
[20] D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty, “Latent dirichletallocation,” Journal of Machine Learning Research, vol. 3, p. 2003,2003.
[21] T. M. Cover and J. A. Thomas, Elements of Information Theory. NewYork, NY, USA: Wiley-Interscience, 1991.
[22] M. Abadi et al., “Tensorflow: Large-scale machine learning onheterogeneous distributed systems,” 2015. [Online]. Available: http://download.tensorflow.org/paper/whitepaper2015.pdf
[23] X. Glorot and Y. Bengio, “Understanding the difficulty of training deepfeedforward neural networks,” in In Proceedings of the InternationalConference on Artificial Intelligence and Statistics (AISTATS10). Societyfor Artificial Intelligence and Statistics, 2010.
[24] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available:http://arxiv.org/abs/1412.6980
[25] A. Ng and M. Jordan, “On Discriminative vs. Generative Classifiers: AComparison of Logistic Regression and Naive Bayes,” in Advances inNeural Information Processing Systems (NIPS), vol. 14, 2001.
[26] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,“Liblinear: A library for large linear classification,” J. Mach. Learn.Res., vol. 9, pp. 1871–1874, Jun. 2008.
3894
[27] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, andD. McClosky, “The Stanford CoreNLP natural language processingtoolkit,” in Association for Computational Linguistics (ACL) SystemDemonstrations, 2014, pp. 55–60.
[28] D. P. Putthividhya and J. Hu, “Bootstrapped named entity recognitionfor product attribute extraction,” in Proceedings of the Conference onEmpirical Methods in Natural Language Processing, ser. EMNLP ’11.Stroudsburg, PA, USA: Association for Computational Linguistics, 2011,pp. 1557–1567.
[29] P. Das, R. Srihari, and Y. Fu, “Simultaneous joint and conditionalmodeling of documents tagged from two perspectives,” in Proceedings ofthe 20th ACM International Conference on Information and KnowledgeManagement, ser. CIKM ’11. New York, NY, USA: ACM, 2011, pp.1353–1362.
[30] H.-F. Yu, C.-H. Ho, Y.-C. Juan, and C.-J. Lin, “LibShortText: A Libraryfor Short-text Classication and Analysis,” Department of ComputerScience, National Taiwan University, Taipei 106, Taiwan, Tech. Rep.,2013.
[31] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:Synthetic minority over-sampling technique,” J. Artif. Int. Res., vol. 16,no. 1, pp. 321–357, Jun. 2002.