Upload
hayden-broxson
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
ADBIS 2007
Discretization Numbers for Multiple-Instances Problem in Relational Database
Rayner AlfredDimitar Kazakov
Artificial Intelligence Group, Computer Science Department,
York University
(30th September, 2007)
30th September 2007 ADBIS 2007, Varna, Bulgaria
Overview
• Introduction• Objectives• Experimental Design
– Data Pre-processing: Discretization– Data Summarization (DARA)
• Experimental Evaluation• Experimental Results• Conclusions
30th September 2007 ADBIS 2007, Varna, Bulgaria
Introduction• Handling numerical data stored in a relational
database is unique– due to the multiple occurrences of an individual
record in the non-target table and – non-determinate relations between tables.
• Most traditional data mining methods deal with a single table and discretization process is based on a single table.
• In a relational database, multiple records from one table with numerical attributes are associated with a single structured individual stored in the target table.
• Numbers in multi-relational data mining (MRDM) are often discretized, after considering the schema of the relational database
30th September 2007 ADBIS 2007, Varna, Bulgaria
Introduction• This paper considers different alternatives for
dealing with continuous attributes in MRDM• The discretization procedures considered in this
paper include algorithms – that do not depend on the multi-relational
structure and also – that are sensitive to this structure.
• A few discretization methods implemented, including the proposed entropy-instance-based discretization, embedded in DARA algorithm
30th September 2007 ADBIS 2007, Varna, Bulgaria
Objectives• To study the effects of taking the one-to-many
association issue into consideration in the process of discretizing continuous numbers.
– Propose the entropy-instance-based discretization method, which is embedded in DARA algorithm
– In DARA algorithm, we employ several methods of discretization in conjunction with C4.5 classifier, as an induction algorithm
– We demonstrate on the empirical results obtained that discretization can be improved by taking into consideration the multiple-instance problem
30th September 2007 ADBIS 2007, Varna, Bulgaria
Experimental Design
• Data Pre-processing – Discretization of Continuous Attributes in Multi-relational
setting using Entropy-Instance-Based Algorithm• Data Aggregation
– Data summarization using DARA as a mean of data summarization based on Cluster dispersion and Impurity
• Evaluation of the discretization methods using C4.5 classifiers
Discretization of Continuous Attributes
Using Entropy-Instance-Based Algorithm
Data Summarization using DARA based on Cluster Dispersion and
Impurity
Relational Data Categorical Data Summarized Data
Learning can be done using any traditional AV
data mining methods
30th September 2007 ADBIS 2007, Varna, Bulgaria
Data Pre-processing: Discretization• To study the effects of one-to-many association issue in
the process of discretizing continuous numbers.
• Propose the entropy-instance-based discretization method, which is embedded in DARA algorithm
• In DARA algorithm, we employ several methods of discretization in conjunction with C4.5 classifier, as an induction algorithm– Equal Height – each bin has same number of samples– Equal Weight - considers the distribution of numeric values
present and the groups they appear in – Entropy-Based – uses the class information entropy – Entropy-Instance-based - uses the class information
entropy and individual information entropy
• We demonstrate that discretization can be improved by considering the one-to-many problem
30th September 2007 ADBIS 2007, Varna, Bulgaria
Entropy-Instance-Based (EIB) Discretization
• Background– Based on the entropy-based multi-interval discretization
method (Fayyad and Irani 1993)– Given a set of instances S, two samples of S, S1 and S2, a
feature A, and a partition boundary T, the class information entropy is
– So, for k bins, the class information entropy for multi-interval entropy-based discretization is
)()( 2
2
1
1
SEntS
SSEnt
S
SE(A,T,S) =
C
1i
ii ))S,C(plog()S,C(p kkEnt(Sk) =
S
SEntSkb bb 1 )(
I(A,T,S,k) =
30th September 2007 ADBIS 2007, Varna, Bulgaria
Entropy-Instance-Based (EIB) Discretization
• In EIB, besides the class information entropy, another measure that uses individual information entropy is added to select multi-interval boundaries for discretization
• Given n individuals, the individual information entropy of a subset S is
IndEnt(S) =
where p(Ii, S) is the probability of the i-th individual in the subset S
• The total individual information entropy for all partitions is
I
1i
ii ))S,I(plog()S,I(p
S
SIndEntSkb bb 1 )(
Ind(A,T,S,k) =
30th September 2007 ADBIS 2007, Varna, Bulgaria
Entropy-Instance-Based (EIB) Discretization
• As a result, by minimizing the function Ind_I(A,T,S,k), that consists of two sub-functions, I(A,T,S,k) and Ind(A,T,S,k), we are discretizing the attribute’s values based on the class and individual information entropy.
S
SEntSkb bb 1 )(*
S
SIndEntSkb bb 1 )(*
S
SIndEntSEntSkb bbb 1 ))()((*
Ind_I(A,T,S,k) =
=
+
30th September 2007 ADBIS 2007, Varna, Bulgaria
Entropy-Instance-Based (EIB) Discretization
• One of the main problems with this discretization criterion is that it is relatively expensive– Use a GA-based discretization to obtain a multi-
interval discretization for continuous attributes, consists of
• an initialization step • the iterative generations of the
– reproduction phase, – the crossover phase and – mutation phase
30th September 2007 ADBIS 2007, Varna, Bulgaria
Entropy-Instance-Based (EIB) Discretization
• An initialization step – a set of strings (chromosomes), where each string
consists of b-1 continuous values representing the b partitions, is randomly generated within the attribute’s values of min and max
– For instance, given minimum and maximum values of 1.5 and 20.5 for a continuous field, we have (2.5,5.5,9.3,12.6,15.5,20.5)
– The fitness function for genetic entropy-instance-based discretization is defined as
f = 1/ Ind_I(A,T,S,k)
30th September 2007 ADBIS 2007, Varna, Bulgaria
Entropy-Instance-Based (EIB) Discretization
• the iterative generations of – the reproduction phase
• Roulette wheel selection is used
– the crossover phase and • a crossover probability pc of 0.50 is used
– mutation phase• a fixed probability pm of 0.10 is used
30th September 2007 ADBIS 2007, Varna, Bulgaria
Data Summarization (DARA)• Data summarization based on Information Retrieval
(IR) Theory• Dynamic Aggregation of Relational Attributes
(DARA) – categorizes objects with similar patterns based on tf-idf weights, borrowed from IR theory
• Scalable and produce interpretable rules
NT
NT
NT
NT
T
NT
NT
NT
NT
T= Target table
NT = Non-target table
= Data Summarization
30th September 2007 ADBIS 2007, Varna, Bulgaria
Data Summarization (DARA)• Data summarization based on Information Retrieval
(IR) Theory• TF-IDF (term frequency-inverse document
frequency) - a weight often used in information retrieval and text mining
• A statistical measure used to evaluate how important a word is to a document in a corpus
• The importance of term increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
30th September 2007 ADBIS 2007, Varna, Bulgaria
Data Summarization (DARA)• In a multi-relational setting,
– an object (a single record) is considered as a document
– All corresponding values of attributes stored in multiple tables are considered as terms that describe the characteristics of the object (the record)
– DARA transforms data representation in a relational model into a vector space model and employs TF-IDF weighting scheme to cluster and summarize them
30th September 2007 ADBIS 2007, Varna, Bulgaria
Data Summarization (DARA)
• tfi∙idfi (term frequency-inverse document frequency)
where ni is the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms.
• The inverse document frequency is a measure of the general importance of the term
with |D|: total number of documents in the corpus and d is the number of documents where the term ti
appears
30th September 2007 ADBIS 2007, Varna, Bulgaria
Data Summarization (DARA)Data Summarization Stages1. Information Propagation Stage
– Propagates the record ID and classes from the target concepts to the non-target tables
2. Data Aggregation Stage– Summarize each record to become a single tuple– Uses a clustering technique based on the TF-IDF
weight, in which each record can be represented as
– The cosine similarity method is used to compute the similarity between two records Ri and Rj ,
cos(Ri,Rj) = Ri·Rj/(||Ri||·|||Rj||)
(tf1 log(n/df1), tf2 log(n/df2), . . . , tfm log(n/dfm))
30th September 2007 ADBIS 2007, Varna, Bulgaria
Experimental Evaluation• Implement the discretization methods in the DARA
algorithm, in conjunction with the C4.5 classifier, as an induction algorithm that is run on the DARA’s discretized and transformed data representation
• chose three varieties of a well-known datasets, the Mutagenesis relational database
– The data describes 188 molecules falling in two classes, mutagenic (active) and non-mutagenic (inactive) and 125 of these molecules are mutagenic.
30th September 2007 ADBIS 2007, Varna, Bulgaria
Experimental Evaluation• three different sets of background knowledge
(referred to as experiment B1, B2 and B3).– B1: The atoms in the molecule are given, as well as
the bonds between them, the type of each bond, the element and type of each atom.
– B2: Besides B1, the charge of atoms are added– B3: Besides B2, the log of the compound
octanol/water partition coefficient (logP), and energy of the compounds lowest unoccupied molecular orbital (ЄLUMO) are added
• Perform a leave-one-out cross validation using C4.5 for different number of bins, b, tested for B1, B2 and B3.
30th September 2007 ADBIS 2007, Varna, Bulgaria
Experimental Results• Performance (%) of leave-one-out cross validation of C4.5 on
Mutagenesis dataset
• The predictive accuracy for EqualHeight and EqualWeight is lower on datasets B1 and B2, when the number of bins is smaller
• the accuracy of entropy and entropy-instance based discretization is lower when the number of bins is smaller on dataset B3
• The result of entropy-based and entropy-instance-based discretization on B1, B2 and B3 are virtually identical, (five out of nine tests EIB performs better than EB)
30th September 2007 ADBIS 2007, Varna, Bulgaria
Conclusions• presented a method called dynamic aggregation of
relational attributes (DARA) with entropy-instance-based discretization to propositionalise a multi-relational database
• The DARA method has shown a good performance on three well-known datasets in term of performance accuracy.
• The entropy-instance-based and entropy-based discretization methods are recommended for discretization of attribute values in multi-relational datasets – Disadvantage – computation is expensive when the
number of bins is large
Thank You
Discretization Numbers for Multiple-Instances Problem in Relational Database