Lecture outline: Data integration and transformation o How ...eacharya.inflibnet.ac.in/data-server/eacharya-documents/53e0c6cbe... · Redundancy control During integration, we should

Lecture outline:

• Data integration and transformation o How to change the data from one form to another o Understand the importance of correlation analysis o Need for integration of data

• Data Reduction • Data Discretization • Concept hierarchy generation

Data Integration:

In data integration, we combine data from multiple sources into a coherent store. This coherent store is called Data Warehouse. The sources include database, data cubes, flat files etc.

While integrating the data we use the integration of Meta data which we call as schema integration.

For e.g. There can be different name exist across two different sources for same attributes. Like,

A.cust-id=B.cust_number

Here, cust-id is used to identify the customers in source A and cust_number is used to identify the customers in source B. In such situation, without understanding the schema representation of two sources, it is very difficult to integrate the sources. That is why schema integration is very important here.

In order to resolve the errors, while integrating the data, Meta data is widely used. The Meta data is also used to help to transform the data. In fact, not only data warehousing, but data mining also requires data integration to find the frequent patterns from large available data. It doesn’t mean all sources are same type when we integrate the data from multiple sources. The sources may be different. Therefore, we need to combine them into a single schema.

So there are several issues, as mentioned, schema integration is one issue. We need to identify a strategy while combining the resources.

The problem of entity identification is very much essential here while integrating the resources. In fact, this is the strategy used to combine the data. We use name equivalence to combine but how can we equivalent real world entities that resources be matched, so the matching is done by using entity identification. For e.g. Customer-id in one data base and customer number in another database refer to the same database. How exactly computer or the data analyst do compare? As said earlier, to solve such problem we use the Meta data. The Meta data keep the details of each attribute like attribute name, meaning of the attribute, type of the attributes, range of the attribute, and null values. Such Meta data help to avoid errors while integrating the schema.

Redundancy control

During integration, we should also see that data is not duplicated. Suppose if the data is duplicated it leads to redundancy, which can reduce the efficiency and leads to inconsistency. Therefore, redundancy also needs to be controlled. When the attribute values are different for same real world entity then, how do we decide whether it is redundant or non-redundant? So, detecting and resolving, data values conflicts, is also an important issue which needs to be considered into account while integrating the data because the same real world entity may have different attributes values in different sources. And also the reason for having these kinds of values may be because of possible representation used in each entity definition. For e.g. Units may be represented in metric unit in one source where as British unit in another source but both represents same kind of data values.

Redundancy handling

Redundant data often occur while integration of multiple database. Redundant data is an important issue that need to be taken into account while integrating the data. The situations which leads to redundancy are:

• Inconsistency: inconsistencies in attributes or dimension naming cause redundancy in data sets.

o Derivable data: If an attributes is derived from other attributes then, the derived attribute is called the redundant data. For e.g. date of birth and age are two attributes present in the database. Age can be derived from the date of birth therefore we can consider age as a redundant attribute.

• Object identification: The same attribute or object may have different names in different database.

Now the question is, How do we detect that an attribute is redundant or not???

So to detect such redundancy we generally use the concept called correlation analysis. That means to say, correlation analysis is useful in detecting the possible redundancy. Careful integration of data from multiple sources may help in reducing/avoiding the redundancies and inconsistencies and improves mining speed and quality i.e. performance of mining gets improves.

Correlation Analysis (Numerical data)

Correlation analysis helps to determine how strongly one attribute implies the other attribute.

Correlation coefficient (also called Pearson’s product moment coefficient)

Correlation coefficient always lies between -1 and +1. By using this formula when we compute the correlation coefficient this coefficient value tells about what kind of relationship exist between the one set of value A with another set of value B.

The higher the coefficient value the stronger the correlation. It means that the more each attribute implies the other higher correlation value may indicate that the attribute A may be removed as a part of redundancy. So if the correlation coefficient is greater than 0 that means A and B are positively correlated. Positively in the sense, A value increase as B value increase. That why we call it as the higher the value the stronger the correlation. Suppose the correlation coefficient is equal to 0 then we call that the attribute A and B are independent that means they are not correlated. And if the correlation coefficient is negative then we say that A and b are negatively correlated. This means each attribute discourages the other. In fact we can also use scatter plots to view the correlation between the attributes.

Now for categorical data we can use the chi-square test to determine the correlation. For categorical that means discrete data a correlation coefficient between the two attributes A and b can be determined by the chi-square test. Suppose attribute A has c distinct values a1, a2...ac values and B has r distinct values b1, b2...br values, the c values of A making up column and r values of B making up rows. The chi-square can be computed by using the counts. Formula for Chi-Sqaure:

So the total values of A is equal to A how many are there with the product of count of B equal to B.

The larger the Chi-Square value the more likely the variables are related. The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count.

Correlation does not imply causality. If A and B are correlated that does not necessarily means that A causes B or B causes A. While analysing , suppose, the demographic database, we may find that the attributes representing the number of hospitals and the number f car theft in a region are correlated. That means we have identified the correlation coefficient close to 1 here say. This doesn’t mean that the number of hospitals causes number of car theft or the number of car theft causes the number of hospitals. In fact both are linked to the third attribute called population. So the more the population more the number of hospital and more the number of car thefts may happen.

For e.g.

By using this Chi-Square formula, if you do the computation, we can say that like science fiction and play chess are correlated are correlated in the group. As the calculated values of chi-Square say 507 is more than the significance level. This Chi-Square statistic test says that A and b are independent. The test is based on the significant level with (r-1) x (c-1) degrees of freedom. Therefore, by using this computation,we can always say that like science fiction and play chess are always correlated in a group. So, for the categorical test we use Chi-Square test.

Data Transformation

Data are transformed and consolidated into suitable forms which are appropriate for mining. There are different kinds of transformation methods which are already discussed earlier. These methods are:

1. Smoothing: It is used to remove the noise from the data. 2. Aggregation: Data cube computation, summarization they all are helpful to reduce the

data bringing regression and clustering also belongs to the category of smoothing methods.

3. Generalization: By replacing the low level attribute value with the higher level values, as per the concept hierarchy, we can generalize the data. So this also belongs to the category of transformation.

4. Normalization: This is basically used to replace the one set of value in a range with another set of values. Basically it reduces the number of possible values in each range. The popular methods used for normalization are:

1. Min-max normalization 2. Z-score normalization 3. Normalization by decimal scaling

We discuss now each one of the method with an e.g.: [slide16]

1. Min-max normalization: Min-max normalization is one kind of transformation which belongs to the category of the linear transformation. The original data given in the range is replaced with new data set of values. Suppose, min and max are the minimum and maximum values of the attributes A, then min and max maps a value V of A to

the value V’ in the new range new_minA and new_maxA . For e.g. income ranges between 12,000 to 98,000 normalized to interval [0,1]. So the normalization is done by using the formula

We know that, minA is 12,000, maxA is 98,000, new_maxA is 1, new_minA is 0 and the value 73,000 is replaced to value V’ in range 0 to 1, by using the above formula. If you substitute all the values in the above formula, we get V’ as 0.76.that means the value 73,000 which belong to the range of 12,000 to 98,000 is replaced to 0.76 in the new range of interval 0 to 1. So this transformation definitely reduces the number of possible value from a given range to a new range. [slide17]

1. Z-score normalization: This is also termed as zero mean normalization because we are taking deviation about the mean. Mean is represented by µ and sigma(σ) as standard deviation.

Suppose, µ=54,000 and σ=16,000, then V’ for a given V=73,600, then we replace with 1.225. that means here the values of attributes are normalized based on the mean and standard deviation of attribute A. [slide18]

1. Normalization by decimal scaling: Here, every value of v is divided by some number 10j, where j is the smallest integer such that max(|V’|)<1. The j give the index such that max(|V’|)<1. So in that fashion we can replace every value of V with V’ by taking the ratio of V with 10j. So, these data transformation method, we can transform the data from one form to another into new interval and this definitely reduces the number of distinct values for each attribute.

We also discussed in this topic how to integrate the data and how to resolve the schema integration problem. So basically both the methods ultimately give you to detect the inconsistency among the data values and also to identify what kinds of redundancy exist among the data values. So in a way correlation coefficient for numerical data as well as categorical data is helpful in detecting the redundancy.

Data reduction

Data reduction is a reduced representation of data set that is much smaller in volume but yet produces the same analytical result or almost the same analytical result. A database or a data warehouse consists of a terabytes of data. In order to make complex data analysis, practical or feasible, there is a need to reduce the huge amount of data for ready to mine using the

concept of data reduction methods. These reduction methods produces a reduced representation of the data which are lesser in volume and maintains closely the integrity of the original data and ultimately produces approximately an equal analytical results.

The various strategies that exist in the literature for data reduction are data cube aggregation dimensionality reduction, data compression, numerosity reduction, discretization and concept hierarchy generation

Data cube aggregation

Aggregation operations are applied to the data and in the construction of the data cube. It stores the multidimensional information, presence of concept hierarchy held to compute aggregation at multiple levels. These data cube provides fast access pre-computed summarised data. Thereby, benefitting overlap and data mining. The lowest level of data cube is called base cuboid and the highest level of cuboid is called apex cuboid. So in each data cube we can have multiple level of aggregation to reduce the size of the aggregation further. For e.g. a customer in phone calling data warehouse. For this data we can build a data cube by individual entity of interest and by applying the aggregation we can find different kind of aggregation using the customer phone call data warehouse. In this approach aggregation operations are applied to building the data cube. In fact the data cubes are multi-dimensional aggregated information that means it stores different aggregations of information. Each cell in a data cube holds an aggregated data value corresponding to data point in the multi dimensional space. It also allows the analysis of the multiple levels so to do that we use the concept of concept hierarchy for each attribute. These cubes provide the faster access to the pre computed summarized data. Data cube created for various level of abstraction are termed as cuboids. In view of these data cube can be viewed as lattice of cuboid. Each higher level abstraction produces the resulting data size. We can also perform queries so on the aggregated information using data cube whenever it is required.

Attribute subset selection

Attribute subset selection is a strategy basically hep to identify automatically the most important and essential attributes out of several attribute for a given data set. This means this process eliminates the irrelevant attributes or redundant attributes from the database. The goal of the attribute subset selection is to find the minimum set of attribute such that the resulting probability distinguishes the data classes is as close as possible to the original data distribution. Obtained using all attributes, the advantage of this process is decreases the number of patterns and also understanding the number of patterns is easy. Due to exponential number of twice exist for a given N attribute heuristic approaches are used to determine the best subset of attributes. The different kind of heuristic methods for solving attribute subset selection problem are:

• Stepwise forward selection • Stepwise backward elimination • Combinig of forward selection backward elimination • Decision tree induction

These methods is useful for determining the optimal number of attributes which we call as features. Optimal number of features. In case of stepwise forward selection, we start with an

empty selection and add best attributes one after the other. That why it is called forward selection. Whereas in the case of stepwise backward elimination, we take all the attributes initially and then we eliminate the less informative attributes from this set recursively. That is why it is called backward elimination. And if we use forward and backward combining then we say it is a combined forward and backward elimination. Say for e.g. the slide shows a simple tree. Suppose if we have six attributes a1, a2,...a6, to identify the features that are essential to determine the class label say class1 or class2 these structure takes minimum of 3 attributes. So out of 6 attribute only 3 attributes are sufficient to determine the class. So first we compare. We divide the data by taking the attribute value a4 then we get distinct classes, two groups, and then in each group attribute a1 is used to determine the class label. Whereas n other group attribute a6is used to determine the class label, so therefore if we combine all the attribute in this tree we get a set of three attributea1, a4,a6, which is a reduced representation of the initial attribute set. So in the case of best stepwise feature selection we consider as I mentioned earlier the best single feature is picked first then the next best single feature is picked next etc. Whereas stepwise feature elimination, the worst feature is eliminated first then next feature is eliminated next. Then we can use backtracking and also we can use feature elimination method to determine optimal branch and bone structure for feature selection method. Another kind of strategy for data reduction is data compression. In fact there extensive theory and well tuned algorithm exist to compress strings but whenever we compress the data it should be typically lossless compression but we can only do limited manipulation without expanding the compressed data. Whereas if you take the audio compression or video compression typically they are lossy compression with some refinement. Sometimes small fragment of signal can be reconstructed without reconstruction of old and typically time sequence is not an audio and there is very sharp turn and very slow varies slowly with time. So if you observe the figure the original data is compressed and stored in a database which we call as a compressed database. The compressed data if we reconstruct the original data completely then we call it a lossless. And if we get only approximately of original data by decompressing then we call it as lossy compression.

Dimensional reduction

In dimensional reduction, we produced the reduced the original data by applying certain data encoding methods or transformation methods. Data reduction produced using this may be lossy or lossless. The original data is reconstructed from the corresponding data without any loss of information as I mention earlier is termed as lossless, otherwise it is lossy. There are several method exist in the literature but we concentrate two popular methods related to lossy dimensional reduction namely wavelet transformation and principal component analysis. Wavelet transformation is a signal processing method. This methods takes data set and viewed as a data vector x, as input and produces numerically different vector x’ of wavelet coefficient as output. Both vector x and x’ are of same length. Here x is viewed as n-dimensional data vector. Due to truncated nature, wavelet transformed reduced data presentation. By considering user defined threshold all Beverly coefficients larger than the user defined threshold will be retained and all coefficient less then user defined threshold set to 0. In view of this, data presentation is very sparse. Operation of the sparse data can be

performed very fast as compared to non-sparse data. This method is also suitable for removing noise without smoothing out the main feature of the data. Also using the universe concept method, from the given set of coefficient and the approximation of the original data is constructed. This approach is used also for multi-dimension data set such data cubes. First we transform the first dimension then to the second dimension and so on. The computational complexity involved is with respect to the number of cells that are in the cube. Wavelet transforms provides good results on sparse data, skewed data and data unordered attributes. So the methods works like this, length l must be an integer power of 2. So if it is not an integer power of 2 so we can pad 0 if required. Then each transform has two function smoothing and difference. And this method is applicable to pay sub data resulting two sets of data of length l/2. Until it reaches it desired length we apply the two functions recursively. Another method is called principal component analysis. This method locates for a k n-dimensional orthogonal vectors that can be used to represent the data. Here, k is always less than the number of data attributes. So that means in this method we find the best k-orthogonal vectors which we termed as principal components that can be used to represent the data. This way the original data is projected on to much smaller space which results in dimensionally reduction. This method combines the essence of attributes by creating an alternative smaller set of variables. Principal component analysis is computationally inexpensive and can be applied ordered attributes as well as unordered attribute. This also handles sparse and skewed data. Principal component can also be used as inputs to regression and cluster method. This is better approach friendly sparse data whereas various transform method is suitable for data of high dimensionality. So the steps of determining the principal component are shown on this slide.

First we normalize the input data. Each attribute falls within the same range that must be ensured here by using the normalisation process. Then we compute k orthogonal vectors. These orthogonal vectors are called principal component. Each input data is linear combination of the k principal component vectors. That means we express each data element is a linear combination of these k principal component vectors. Principal component are sorted in decreasing order of the significance. Since already components are sorted their sizes of the data can be reduced by eliminating the weak components. That means those components with low variance gets eliminated. So using the strongest principal components it is also possible to reconstruct a good approximation of the original data. But of course this method is very widely used only for numerical data, and also when number of dimensional is large we can apply this principal component analysis approach for data reduction. So this slide shows the components how the principal component identifies the orthonormal vectors. Here there are two vectors shown as orthonormal vectors. Which we consider as principal component. Then the next approach is numerous reductions. This technique calls reduces the data volume by choosing alternative smaller forms of data representation. And these are categorized into two ways:

• Parametric method • Non parametric method

In parametric method, we assume that the data fits some model and estimate the model parameters and finally we store only the parameters and discard the data. Log linear model is an e.g. of parameter method. It obtains the value at point in n-dimensional space as a product on appropriate marginal subspaces. Whereas in the case of no-parametric method, we do not use any models but we use histograms clustering and sampling to restore the reduced representation of the data.

Regression and log linear models

Data in linear regression data is modelled by fitting a straight line. Buy fitting a straight line, we use popular riskless method by considering the condition that there is some of the squares of the error is kept at the minimum. In the case of multiple regressions, a response variable y is modelled as a linear function of multi dimensional feature vector. So in a parametric method we only store the parameter instead of actually data. So the slide shows linear regression and log linear model, “Y=wX + b”. Here we approximate the values of w and b and then we store the only given value of x what is a approximate value of y. The 2 coefficient of w and b specified the line that has to be estimated the data at hand. That means for the given set of point x1,y1,etc xn,yn. We estimate the coefficient of w and b in such a way that there sum of the square of the error is kept at the minimum. Using the list criterion, we minimize the error. In case of multiple regression, we have different multiple set of variables x1 and x2 are two variable while they are expressed as x1 and x2. So in this case we estimate coefficient b0,b1and b2 in such a way that there sum of the error is kept at the minimum. The multi-way table join probability is approximated by a product of lower order tables in another case of log-linear models use in the probability calculation. Another kind of data reduction method which belongs to the category of non-parametric kind is known as the histograms. Histograms divide data into buckets and store the average of each bucket. Of course we can use either equal width partitioning or equal frequency partitioning. Equal bucket range is taken in the case of equal width partitioning. Whereas equal depth occurrences based on the occurrences values we use equal frequency partitioning. With list histograms variance we can also find the optimal value in the sense that by taking the weighted sum of the original values that each bucket represents. Maximum difference is another parameter to consider the account here. We can set bucket boundary between each pair of. So basically whenever we plot a histogram the frequency of each interval and the number of occurrence. Then another data reduction called clustering. In clustering, we partition the data into different cluster based on the similarity measure and stores only the clustering representation. That means centroid and diameter are basically two parameters used to represent the cluster. This approach can be very effective if the data is clustered but not if data is smeared. Of course we can also have hierarchy clustering and we can store these hierarchical clusters in a multidimensional structure. There are many choices of clustering definitions and algorithms exist in the literature but ultimately cluster analysis helps you to identify the optimal number of cluster such that all the elements within the cluster are highly similar and the element across the cluster are highly dissimilar. Another kind of data reduction method is called sampling. Sampling process obtains the

small sample s to represents the whole data set N. This method allows a mining algorithm to run in a complexity that is potentially sub linear to the size of the data. Normally we choose the representative of the subset of the data. Sometime simple random sampling may have poor performance in the presence of skewed data. And adapt to sampling approach, one such method is called stratified sampling. In this method, we approximate the percentage of each class are sub population of interest in overall database. This approach can also be used in conjunction with the skewed data. Of course, sampling may not reduce database I/O’s, it only identifies the subset of the data. With replacement and without replacement The slide shows the 2 different kinds of differences simple random sample without doing replacement whereas simple random samle with replacement. Clustering are stratified sampling approaches to determine the reduced presentation of the data. Here for e.g. a raw data clustered into 3 cluster but some of the element within each cluster are removed by using the stratified sample so that means here we identify the element in such a way that all the elements in each cluster provides the reduced representation for all the element within that cluster.

Discreation concept hierarchy

To know how to generate the concept hierarchy for numerical data. And also how to generate concept hierarchy for categorical data based on the distinct values of attributes in the database schema. Discretization and concept hierarchy methods are also part of pre-processing. The main advantage of using this technique is that mining on reduced representation of data need less number of input or output operations and is more efficient than mining on large non-generalised data set. Discretization generally uses different kinds of attribute. An attribute may be a nominal type where the vales form an unordered set for e.g. colour, profession etc. Or it can be an ordinal where the values form an ordered set academic rank of organisational hierarchy belongs to this hierarchy or it can be continuous where real number integers are real number. Each continuous attribute is divided into various intra layer by taking its range and each interval is named with the label called the interval label while the data. This way several values which belong to a particular interval of a continuous tribute is replaced with the corresponding interval label. This process simplifies and reduces the original data and also this process also provides easy to use and knowledge level representation of mining result. So discretization also belongs to category of data reduction prepare further data for analysis.

Concept hierarchy is also another kind of discretization process

A discretization method, if the class information is not used then the discretization method is termed as unsupervised discretization methods. You can apply discretization process either in a top down manner or a bottom up manner. In top down discretization, we use the split approach. In this the process starts by identifying one or few points called cut points or split points to divide the entire attribute range and this process gets repeatedly recursively on the resulting intervals .

Whereas the case of bottom up discretization, the process starts by considering all the continuous values as potential split points .Some split points by merging to form intervals. This process gets repeatedly recursively to the resulting intervals.

Concept hierarchy for a given numerical attribute also defines the discretization for that attribute. So it recursively reduced the data by collecting and replacing the low level concept by higher level concepts. For e.g. Numerical values for each attribute may be replaced in each row by higher level concepts as N middle age or as senior people. Another e.g. for concept hierarchy is based on discretization is suppose you have a marks attribute and the vales of marks which can be replaced by high level concepts such as excellent, good, satisfactory, or fail and replace differently it will minimize the number of distinct values for attributes marks. This type of generalization is more meaningful and easier to interpret, although we lost the original data. That advantage of this approach is that it provides consistent representation of data mining among multiple data mining task. Concept hierarchy can be generated manually or automatically. Automatic generation of concept hierarchy for numerical and categorical attribute is less tedious and consumes less time for a domain expert or a user whereas manual generation of conceptual hierarchy is more laborious and consumes more time. Also many conceptual hierarchy for categorical attributes is implicit within the data in the database schema and so it can be automatically generated at the schema definition. In view of the above facts conceptual hierarchy for numerical attributes can automatically be generated based on the data discretization. Now the question is can we generate concept hierarchy manually? So the answer is yes. It is possible to generate but it requires lot of time and a tedious process. You know why it is difficult. Ranges for each attribute may be very wide and we update database very frequently. Due to these reasons, manually defining conceptual is tedious.

But for numerical data method exist for the automatic generation of concept hierarchy or dynamically to refine the concept hierarchy. In fact many hierarchy for categorical attributes are implicit within the database schema and can be automatically defined at the schema definition level.

Now we discuss the different kinds of discretization method. Typical discretization methods are:

• Beginning method • Histogram analysis • Cluster analysis • Entropy based Discretization • Discretization by intuitive partitioning

Beginning method

Top down split technique based on specified number of bins, bins are used here. Since the beginning approach is not using any information related to class information and it belongs to unsupervised class of methods. This methods are sensitive to the number of bins specified by user and also sensitive to presence of out layers. In fact in earlier lecture we discussed various smoothing methods for binging. Methods are also useful for generating conceptual hierarchy. Histogram analysis

This approach also belongs to unsupervised since this approach doesn’t use any class information. This approach partitions each attribute values to disjoint buckets and this approach belong to top down split and unsupervised class. Basically histograms are useful to approximate data distribution. For generating histograms we use partition rules. In equal width histogram the values are partition into equal sized partition. For e.g. marks attribute can be partition into four intervals with width as 24. Whereas in the case of frequency histogram the values are partitioned so that each partition contains same no of data tuples. By applying the histogram analysis’s approach recursively to each partition we can automatically generate the concept hierarchy with multi level. Of course we can also use the minimum interval size parallel to control the recursive procedure. Minimum interval size specifies as a minimum width of a partition or minimum number of values for each partition at each level. Based on cluster analysis of data distribution also you can partition the histogram. The slide shows a histogram for a unit prize by partitioning into different intervals. Histogram consists of several rectangles that reflect the counts or frequencies of the classes present in the given data. The other kind of method is known as entropy based discretization.

Cluster analysis

Cluster analysis also uses top down split or bottom up merge approach. It belongs to the category unsupervised. All the methods namely beginning method, clustering methods and histogram methods can be applied recursively to generate the concept hierarchy. Each method in general assumes that values discretized are sorted in ascending order. Other kinds of method are:

• Entropy based discretization • Interval merging by Chi-Square analysis • Segmentation by natural partitioning

Entropy based discretization

Entropy is commonly used technique for discretization measure and it belongs to the information theory and it is based on the concept of information bing. Entropy based discretization is a supervised and top-down splitting method. This method explores class distribution information in calculating and determining split points. Here, data values are partitioned based on attribute range. Given a set of samples S. If S is partitioned into two interval s1, s2 using the boundary t. Here boundary t can be considered as a split point. The information gained after partitioning is calculated with the formula.

Is,t=(the number of ratio of values that are there in s1 with total number of values s with entropy of s1 multiplied with the entropy of s1 )+ (the number of ratio of values that are there in s2 with total number of values s with entropy of s2 multiplied with the entropy of s2)

Entropy is calculated again based on the class distribution of the samples in a set. Given m classes, the entropy of s1 is calculated using the formula shown in the slide. Here Pi is the probability of class i in s1. To discretized the numerical attribute say A, this method select a split point which has a minimum entropy and recursively partition the resulted interval to arrive at the hierarchical discretization and forms finally the concept hierarchy for attribute A.

So the boundary or split point may reduce the data size and improve the classification accuracy. In fact the recursive process is stopped when some stopping criteria is met.

Now we discuss the different steps. The class level attribute defines the class information particulars. The basic method for entropy method discretization consists of three steps.

Each attribute value of A is taken as interval boundary or split point to partition the range of A. That is split point of A can divide the tuples in D into two sub-sets, satisfying the condition that A less than or equal to split point and A greater than split point, respectively. These steps help us in creating the binary discretization. This means that we are dividing the database into two parts. All the rows where the attribute value A less than or equal to split point as one group and all the attributes value of A greater than split point as another group. In step two, entropy based discretization uses the information regarding the class label of tuples. Suppose we want to classify the D by partition on attribute A and some split point. What we need to determine is a split point in such a way that it separate the tuples of D into two disjoint classes. That is it should return the exact classification. This is an ideal situation. For e.g. with respect to D all tuples to say class c1 will fall into one partition and all tuples of class c2 will fall into another partition. This is not always possible because first partition may contain some tuples of c1 and some tuples of c2. The question is, how much more information is needed further for perfect classification after this splitting. This we termed as expected information requirement. The expected information requirement of database D based on attribute A is computed using the formula shown on the slide in step two. Number of values in D1 by number of values of D multiply by entropy of D1 plus the number of values in D2 by number of values of D multiply by entropy of D2. Here D1 and D2 corresponds to the tuples of D satisfying the condition that A is less than or equal to split point and A is more than split point. Number of tuples in D is denoted by cardinality of D. The entropy function is calculated based on the class information of the tuples in the set. For e.g. for a given m classes c1, c2...cm the entropy of D1 is shown on the slide. Here Di is calculated as minus sigma from 1 to m pi log2(pi). Here, pi is the probability of class ci in D1, determined by dividing the number of classes of ci in d1 by the total number of tuples in D1. Therefore, when selecting a split point for an attribute A, we are interested in picking the attribute value that gives the minimum expected information requirement. So this result in still how much minimum amount of expected information needed to perfectly classify tuples after partitioning by A less than or equal to split point or A more than split point. This ultimately equivalent to the attribute value pair with the maximum information gained. Similarly, we calculate the entropy of D2, we use the split point to partition the range of A into two intervals corresponding to this split points. So in a nut-shell step two basically determines the expected information requirement for classifying the tuples.

Step three specifies the stopping condition. Stopping condition used here is the minimum information requirement on all kind of split points which is less than some minimum threshold or when number of intervals is greater than the maximum threshold called the max interval. This approach reduces the data size since it uses the class information it belongs to supervised approach interval boundaries or split point are defined to occur in places where it can give better classification accuracy. The next method is interval merging by Chi-Square analysis. So the methods discussed so far are based on top down splitting strategy where this approach belongs to bottom up merging

strategy. That means interval merging by Chi-Square analysis is a bottom up merge approach. And also it belongs to supervised as it belongs to class information.

Three conditions are used in this method as stopping criterion. Merging stops when chi square values of all pairs of adjacent intervals exceeds some threshold which is generally determined by specified significance level .Very high value of significance value for chi square may cause over discreditiaziton and very low value may cause under discredizaiton. That is why significance level is generally chosen between .10 and .91.

The second condition is based on members of interval cannot be pre specified for e.g. max interval here can’t be pre specified.

Third condition is relative cost frequency should be fairly consistent in given interval. Some consistency is allowed in practice but within the pres specified interval I repeat within the pre specified threshold which is estimated from training data generally we use 3 percent of data. This helps to remove relevant attributes from data sets. So this method starts like this Each distinct values of numerical value is considered to be monetary then chi-square tests are performed for every pair of adjacent intervals with least chi square values are merged together since low chi square values of a pair indicate similar class of distribution . The merge process proceeds recursively until a predefined stopping criterion Is met which we have defined earlier 3 different kind of stopping conditions .Using these three conditions we by following these three conditions the merge process proceed recursively until this three conditions are met

The matter discussed so far are useful for returning numerical hierarchies but many users may be interested in viewing numerical range as partitioned into relative uniform. Easy to read interval that appear natural or intuitive. For e.g. salaries of employees annual salaries may be divided into o ranges like Rs. 75000 to 80000 are often more desirable then ranges like 76774 to 85102 obtained by some clustered analysis approach. The rule partitions a given range of data into three four or five relative ly equal width interval recursively and level by level based on the values range at the most significant digits . By applying these rules recursively for each interval ultimately gives conceptual hierarchy for given numerical data. The rules are shown on the slide. If an interval covers either there distinct values or six distinct values or seven distinct values or nine distinct value at the most significant digits partition the range into 3 equal width intervals for 3,6 and 9 distinct values and three intervals in the grouping of 2,3,2 for 7 seven distinct values. Suppose if it covers 2, 4 or 8 distinct value at most significant intervals portions the range into 4 equal width intervals. If it covers 1,5,10 distinct values at most significant digits partition the range into 5 equal width intervals. These rules are followed in returning the partitions .Real world data is often tend to contain extremely large positive and negative outlier values. These values may distort any top down discredization .these may distort any top down discredization based on minimum and maximum data values for e.g. Annual income of few people may be several order of magnitude higher then those of others’ in same data sets. If we perform discredization on maximum values may lead to biased hierarchy hence discredization must be done and some range of data values representing majority data. Extremely high or extremely low values beyond the top level discredizaiont must be handled separately but in similar manner i.e. we consider data between 5th percentile and 95the percentile of the data. So that means to say we eliminate unnecessary interval width partitions for extreme situations. Suppose a user desires the automatic generation of a concept hierarchy for attribute profit from the reliability point o f view we use interval notation L,R represents the interval one interval l,r For .e.g. -10lakt

dollar to zero dollars denotes 10lakh dollar inclusive to 0 dollar inclusive. Suppose the data within the 5th percentile and 95th percentile between the range say. This high value and low value. So instead of taking all data values extreme values on left values on the left side and extreme values on the right side are dropped by considering 5th percentile and 95th percentile as low and high values. So we want to partitions the data in a natural way. All the profit values which lies between the 5th percentile and 95the percentile data. Using this information the minimum and maximum values are considered for the attribute profit and the low 5th percentile and high 95th percentile values to be considered for the top or first level of discreditzation by setting the low value as -159 dollar and high value as -10838 millions. So in fact these numbers are rounded here. So rounding the low down to million dollar digit we get low` as -1million and rounding high up to million dollar digit we get high’ as +2million dollar. Since interval ranges over the three distinct values at the most significant digits is three. The segment is partitioned into three equal width segments according to three different rules. The most significant digits is taken as 2million dollar –(-1 million dollar) by 1 million dollar i.e. 2+1=3 /1 i.e. 3. So since most significant digit is 3 we divide the data into equal width segments according to 3, 4, 5 rule. -1million dollar to zero dollar as one interval, zero million dollar to 1 million dollar as second interval and 1million dollar to 2miillion dollar as third interval . This represents top tier of hierarchy. So this is top tier of hierarchy. Top tier of hierarchy is represented here in this interval is decided the number of interval is decided on 3,4,5 rule. By further examining each interval recursively we can find out different kinds of intervals for each sub interval of step 3. So this way by recursively using each interval I repeat this way recursively each interval can be further partitioned according to 3-5 rule for next lower level of the hierarchy. So this is an e.g. for numeric concept hierarchy generation by intuitive partitioning. So the intuitive partitioning is done based on these rules only. Concepthiererchay generartion for categorical data. Categorical data or discreet data have finite but possibly large number of distinct values with no ordering among the values exists. Geographic locations, item types, job categories, they are all e.g. of categorical data. Several methods exist for concept hierarchy generation for such kind of data. They are grouped into three different kinds. The first method involves a group of attributes. User define here a partial or total order of attribute at schema level for e.g. location dimension as a concept hierarchy, set of streets is a city set of city constitute a state, set of state constitute a country. So some kind of containment relation is used here, to hierarchy organize the several street values of various cities. According the states in a state in a country whereas in the second case belong to manual definition of apportion of a concept hierarchy basically this type of mechanism is used to define explicit groupings for a small portion of intermediate level of data. For .e.g. Andhrapradesh, Karnataka and Tamil Nadu are parts of India or southern India. So this explicit grouping where doing of different states of a country India. In the third method a user may specify a set of attribute forming a concept hierarchy but omit to explicitly state their partial ordering. In such a case system generates attribute ordering to construct a concept hierarchy. For e.g. only streets precedes city not others. So therefore in such situation we should have a mechanism to automatically generate a concept hierarchy by analyzing the number of distinct values. For e.g. for a set o f attributes street city state country we can analyze how many distinct streets are there, how many distinct cities are there. How many distinct states are there? How many distinct countries are there? So by analyzing this we can find automatically a concept hierarchy generation. For e.g. by realizing the attribute values in a database over street city province state or country. Suppose if there 16 different values for a country and 365 distinct values for province or state and 3567 distinct values for city and finally 674339 distinct values for street exist in your database then we can generate automatically the concept hierarchy by arranging number of distinct

values of each attribute in increasing order and then we put the highest level attribute having the lowest number of distinct values as a root node in the concept hierarchy i.e. attribute with most distinct value is placed at the lowest level of hierarchy. Here the most number of distinct values that exist for attribute street that is why keeping at the lowest level. We repeat the process recursively to arrange the hierarchy. Of course there can be some exceptions exist for the time dimension, week day, month quarter, year. In this situation there will be7 days but months are 12. But number of quarter are 4. Year m ay be any number. But this is an exception as far as automated generation of concept hierarchy is concerned. So that is to say some hierarchies can be automatically generated based on analysis of number of distinct value of attributes in data sets. So this way we can generate the concept hierarchy automatically by just computing total distinct value for each attribute and placing the maximum number of distinct values that exist for an attribute at the lower level and placing the less number of distinct value o f attributes at higher level. In this lecture we discussed three different topic namely, data reduction, data transformation and normalization. And concept hierarchy for numerical data as well as for categorical data. So all these methods are part of data pre processing. These methods are to be applied in designing data warehouse we need to appropriately choose the method for reduced representation of the data.

Thank You.

Documents

Lecture outline: Data integration and transformation o How ...eacharya.inflibnet.ac.in/data-server/eacharya-documents/53e0c6cbe... · Redundancy control During integration, we should