ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer

ADBIS 2007

Discretization Numbers for Multiple-Instances Problem in Relational Database

Rayner AlfredDimitar Kazakov

Artificial Intelligence Group, Computer Science Department,

York University

(30th September, 2007)

30th September 2007 ADBIS 2007, Varna, Bulgaria

Overview

• Introduction• Objectives• Experimental Design

– Data Pre-processing: Discretization– Data Summarization (DARA)

• Experimental Evaluation• Experimental Results• Conclusions


Introduction• Handling numerical data stored in a relational

database is unique– due to the multiple occurrences of an individual

record in the non-target table and – non-determinate relations between tables.

• Most traditional data mining methods deal with a single table and discretization process is based on a single table.

• In a relational database, multiple records from one table with numerical attributes are associated with a single structured individual stored in the target table.

• Numbers in multi-relational data mining (MRDM) are often discretized, after considering the schema of the relational database


Introduction• This paper considers different alternatives for

dealing with continuous attributes in MRDM• The discretization procedures considered in this

paper include algorithms – that do not depend on the multi-relational

structure and also – that are sensitive to this structure.

• A few discretization methods implemented, including the proposed entropy-instance-based discretization, embedded in DARA algorithm


Objectives• To study the effects of taking the one-to-many

association issue into consideration in the process of discretizing continuous numbers.

– Propose the entropy-instance-based discretization method, which is embedded in DARA algorithm

– In DARA algorithm, we employ several methods of discretization in conjunction with C4.5 classifier, as an induction algorithm

– We demonstrate on the empirical results obtained that discretization can be improved by taking into consideration the multiple-instance problem


Experimental Design

• Data Pre-processing – Discretization of Continuous Attributes in Multi-relational

setting using Entropy-Instance-Based Algorithm• Data Aggregation

– Data summarization using DARA as a mean of data summarization based on Cluster dispersion and Impurity

• Evaluation of the discretization methods using C4.5 classifiers

Discretization of Continuous Attributes

Using Entropy-Instance-Based Algorithm

Data Summarization using DARA based on Cluster Dispersion and

Impurity

Relational Data Categorical Data Summarized Data

Learning can be done using any traditional AV

data mining methods


Data Pre-processing: Discretization• To study the effects of one-to-many association issue in

the process of discretizing continuous numbers.

• Propose the entropy-instance-based discretization method, which is embedded in DARA algorithm

• In DARA algorithm, we employ several methods of discretization in conjunction with C4.5 classifier, as an induction algorithm– Equal Height – each bin has same number of samples– Equal Weight - considers the distribution of numeric values

present and the groups they appear in – Entropy-Based – uses the class information entropy – Entropy-Instance-based - uses the class information

entropy and individual information entropy

• We demonstrate that discretization can be improved by considering the one-to-many problem


Entropy-Instance-Based (EIB) Discretization

• Background– Based on the entropy-based multi-interval discretization

method (Fayyad and Irani 1993)– Given a set of instances S, two samples of S, S1 and S2, a

feature A, and a partition boundary T, the class information entropy is

– So, for k bins, the class information entropy for multi-interval entropy-based discretization is

)()( 2

2

1

1

SEntS

SSEnt

S

SE(A,T,S) =

C

1i

ii ))S,C(plog()S,C(p kkEnt(Sk) =

S

SEntSkb bb 1 )(

I(A,T,S,k) =



• In EIB, besides the class information entropy, another measure that uses individual information entropy is added to select multi-interval boundaries for discretization

• Given n individuals, the individual information entropy of a subset S is

IndEnt(S) =

where p(Ii, S) is the probability of the i-th individual in the subset S

• The total individual information entropy for all partitions is

I

1i

ii ))S,I(plog()S,I(p

S

SIndEntSkb bb 1 )(

Ind(A,T,S,k) =



• As a result, by minimizing the function Ind_I(A,T,S,k), that consists of two sub-functions, I(A,T,S,k) and Ind(A,T,S,k), we are discretizing the attribute’s values based on the class and individual information entropy.

S

SEntSkb bb 1 )(*

S

SIndEntSkb bb 1 )(*

S

SIndEntSEntSkb bbb 1 ))()((*

Ind_I(A,T,S,k) =

=

+



• One of the main problems with this discretization criterion is that it is relatively expensive– Use a GA-based discretization to obtain a multi-

interval discretization for continuous attributes, consists of

• an initialization step • the iterative generations of the

– reproduction phase, – the crossover phase and – mutation phase



• An initialization step – a set of strings (chromosomes), where each string

consists of b-1 continuous values representing the b partitions, is randomly generated within the attribute’s values of min and max

– For instance, given minimum and maximum values of 1.5 and 20.5 for a continuous field, we have (2.5,5.5,9.3,12.6,15.5,20.5)

– The fitness function for genetic entropy-instance-based discretization is defined as

f = 1/ Ind_I(A,T,S,k)



• the iterative generations of – the reproduction phase

• Roulette wheel selection is used

– the crossover phase and • a crossover probability pc of 0.50 is used

– mutation phase• a fixed probability pm of 0.10 is used


Data Summarization (DARA)• Data summarization based on Information Retrieval

(IR) Theory• Dynamic Aggregation of Relational Attributes

(DARA) – categorizes objects with similar patterns based on tf-idf weights, borrowed from IR theory

• Scalable and produce interpretable rules

NT

NT

NT

NT

T

NT

NT

NT

NT

T= Target table

NT = Non-target table

= Data Summarization


Data Summarization (DARA)• Data summarization based on Information Retrieval

(IR) Theory• TF-IDF (term frequency-inverse document

frequency) - a weight often used in information retrieval and text mining

• A statistical measure used to evaluate how important a word is to a document in a corpus

• The importance of term increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.


Data Summarization (DARA)• In a multi-relational setting,

– an object (a single record) is considered as a document

– All corresponding values of attributes stored in multiple tables are considered as terms that describe the characteristics of the object (the record)

– DARA transforms data representation in a relational model into a vector space model and employs TF-IDF weighting scheme to cluster and summarize them


Data Summarization (DARA)

• tfi∙idfi (term frequency-inverse document frequency)

where ni is the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms.

• The inverse document frequency is a measure of the general importance of the term

with |D|: total number of documents in the corpus and d is the number of documents where the term ti

appears


Data Summarization (DARA)Data Summarization Stages1. Information Propagation Stage

– Propagates the record ID and classes from the target concepts to the non-target tables

2. Data Aggregation Stage– Summarize each record to become a single tuple– Uses a clustering technique based on the TF-IDF

weight, in which each record can be represented as

– The cosine similarity method is used to compute the similarity between two records Ri and Rj ,

cos(Ri,Rj) = Ri·Rj/(||Ri||·|||Rj||)

(tf1 log(n/df1), tf2 log(n/df2), . . . , tfm log(n/dfm))


Experimental Evaluation• Implement the discretization methods in the DARA

algorithm, in conjunction with the C4.5 classifier, as an induction algorithm that is run on the DARA’s discretized and transformed data representation

• chose three varieties of a well-known datasets, the Mutagenesis relational database

– The data describes 188 molecules falling in two classes, mutagenic (active) and non-mutagenic (inactive) and 125 of these molecules are mutagenic.


Experimental Evaluation• three different sets of background knowledge

(referred to as experiment B1, B2 and B3).– B1: The atoms in the molecule are given, as well as

the bonds between them, the type of each bond, the element and type of each atom.

– B2: Besides B1, the charge of atoms are added– B3: Besides B2, the log of the compound

octanol/water partition coefficient (logP), and energy of the compounds lowest unoccupied molecular orbital (ЄLUMO) are added

• Perform a leave-one-out cross validation using C4.5 for different number of bins, b, tested for B1, B2 and B3.


Experimental Results• Performance (%) of leave-one-out cross validation of C4.5 on

Mutagenesis dataset

• The predictive accuracy for EqualHeight and EqualWeight is lower on datasets B1 and B2, when the number of bins is smaller

• the accuracy of entropy and entropy-instance based discretization is lower when the number of bins is smaller on dataset B3

• The result of entropy-based and entropy-instance-based discretization on B1, B2 and B3 are virtually identical, (five out of nine tests EIB performs better than EB)


Conclusions• presented a method called dynamic aggregation of

relational attributes (DARA) with entropy-instance-based discretization to propositionalise a multi-relational database

• The DARA method has shown a good performance on three well-known datasets in term of performance accuracy.

• The entropy-instance-based and entropy-based discretization methods are recommended for discretization of attribute values in multi-relational datasets – Disadvantage – computation is expensive when the

number of bins is large

Thank You

Discretization Numbers for Multiple-Instances Problem in Relational Database

Documents

ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer