21
Merging Taxonomies

Merging Taxonomies. Assertion Creation and maintenance of large ontologies will require the capability to merge taxonomies This problem is similar to

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Merging Taxonomies

Assertion

Creation and maintenance of large ontologies will require the capability to merge taxonomies

This problem is similar to the problem of merging e-commerce catalogs

R. Agrawai, R. Srikant: On Catalog Integration. WWW-10

Catalog Integration Problem

Integrate products from new catalog into master catalog.

a

ICs

LogicMem.DSP

fec db

ICs

Cat 2Cat 1

yx z

New CatalogMaster Catalog

The Problem (cont.)

After integration:

ICs

LogicMem.DSP

a fec db yx z

Desired Solution

Automatically integrate products:little or no effort on part of user.domain independent.

Problem size:million productsthousands of nodes in the hierarchy

How do we do it

Build classification model (rules) using product descriptions in master catalog. Example: If the product description contains "DRAM", the

product is likely to be in the "Memory" category.

Use classification model to predict categories for products in the new catalog.

Logic

DSPx

5%

95%

National Semiconductor Files

Part: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverPart_Id: DS14185 Manufacturer: nationalTitle: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverDescription: The DS14185 is a three driver, five receiver

device which conforms to the EIA/TIA-232-E standard.The flow-through pinout facilitates simple non-crossover board layout. The DS14185 provides a one-chip solution for the common 9-pin serial RS-232 interface between data terminal and data communications equipment.Part: LM3940 1A Low Dropout Regulator Part: Wide Adjustable Range PNP Voltage RegulatorPart: LM2940/LM2940C 1A Low Dropout Regulator

...

...

...

National Semiconductor Files with CategoriesPart: DS14185 EIA/TIA-232 3 Driver x 5 Receiver Pangea Category:

Choice 1: Transceiver Choice 2: Line Receiver Choice 3: Line Driver Choice 4: General-Purpose Silicon Rectifier Choice 5: Tapped Delay Line

Part: LM3940 1A Low Dropout RegulatorPangea Category:

Choice 1: Positive Fixed Voltage RegulatorChoice 2: Voltage-Feedback Operational AmplifierChoice 3: Voltage ReferenceChoice 4: Voltage-Mode SMPS ControllerChoice 5: Positive Adjustable Voltage Regulator

...

...

Accuracy on Pangea Data

B2B Portal for electronic components:1200 categories, 40K training documents.500 categories with < 5 documents.

Accuracy:72% for top choice.99.7% for top 5 choices.

Enhanced Algorithm

Use affinity information in catalog to be integrated:Products in same category are similar.Bias the classifier to incorporate this information.

Accuracy boost depends on quality of current catalog:Use tuning set to determine amount of bias.

Algorithm

Extension of the Naive-Bayes classification to incorporate affinity information

Empirical Results

71-22-6 79-21 100

Purity (No. of classes & their distribution)

0

5

10

15

20

% E

rro

rs Standard

Enhanced

Improvement in Accuracy (Pangea)

1 2 5 10 25 50 100 200

Weight

65

70

75

80

85

90

95

100

Ac

cu

rac

y

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Improvement in Accuracy (Reuters)

1 2 5 10 25 50 100 200

Weight

82

84

86

88

90

92

94

96

98

100

Ac

cu

rac

y

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Improvement in Accuracy (Google.Outdoors)

1 5 25 100 400 1000

Weight

50

60

70

80

90

100

Ac

cu

rac

y

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Tune Set Size (Pangea)

0 5 10 20 35 50

Tune Set Size

70

75

80

85

90

95A

ccu

racy

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Similar results for Reuters and Google.

Summary

The catalog integration technolgy can be directly used for creating and evolving large taxonomies

See WWW-2000 paper for experimental results on merging Yahoo and Google categorizations

Naive Bayes Classifier

Estimates the probability of a product belonging to Estimates the probability of a product belonging to a classa class

Pr(class | product) = Pr(class) * Pr(product | class) Pr(class | product) = Pr(class) * Pr(product | class) / Pr(product)/ Pr(product)

Pr(class) : # products in class / total productsPr(class) : # products in class / total productsPr(product) : same for all classes ( Pr(product) : same for all classes ( classesclasses Pr(class) * Pr(class) *

Pr(product | class) )Pr(product | class) )

How to compute Pr(product | class)?How to compute Pr(product | class)?

Naive Bayes Classifier (cont.)

Pr(Pr(productproduct | class) = Pr( | class) = Pr(productproduct description | class) description | class) * Pr(* Pr(productproduct attributes | class) attributes | class)

= = words in descriptionwords in description Pr(word | class) *Pr(word | class) * attributes attributes

Pr(APr(Aii = v= vkk | class)| class)assumption: words, attributes occur assumption: words, attributes occur independentlyindependently

Pr(word | class)Pr(word | class) (n+ (n+ ) / (t +) / (t + *|Vocabulary|)*|Vocabulary|)n : number of times word occurs in classn : number of times word occurs in classt : total number of words in classt : total number of words in class

Enhanced Classifier

S: node in new hierarchyS: node in new hierarchy

Pr(class | product, S) Pr(class | product, S) Pr(class | S) * Pr(product Pr(class | S) * Pr(product | class) / Pr(product | S)| class) / Pr(product | S)

Ignore Pr(product | S)Ignore Pr(product | S)

Pr(class CPr(class Cii | S) | S) (|C(|Cii| * Number of products in S | * Number of products in S predicted to be from Cpredicted to be from Cii))ww / / j (|Cj (|Cjj| * Number of | * Number of products in S predicted to be from Cproducts in S predicted to be from Cjj))ww

w determines the weightw determines the weight

Algorithm Outline

For each node S in the member hierarchy: For each product p in S:

i. Tentatively classify p using the standard model.

ii. Use the results of Step 1 to compute Pr(class | S).

iii. Re-classify each product in S using the enhanced model.