ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES

ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES

Yiwen Fan

Purpose: Warehouse medical databases:

Clinical databases: have accumulated large quantities of information about patients and their medical conditions

Warehouse these databases and to analyze the patient’s condition: we need an efficient data mining technique.

Data Mining Process: Data warehousing, data query and cleaning, and data analysis.

Three major data mining Techniques

RegressionClustering

Classification

Techniques used in this paper

Two phase: Clustering and Classification

First phase: To Use Rough Set Theory for Clustering ( clustering technique will reduce the complexity of the RST result)

Second phase: Using Fuzzy Logic to classify the result of the clusters.

Rough Set Theory (RST): Cluster

Fuzzy logic: Classification

Definition of Clustering: A kind of data mining techniques for warehousing the heterogeneous database. And it is used to group data that have similar characteristics in the same cluster and also group the data that have dissimilar characteristics with other clusters. (used to handle uncertainty and incomplete information)

Previous clustering techniques :

K-Means Expectation Maximization

Association RuleK-Prototype

Fuzzy K-Modes etc.

Phase 1 – Clustering

Definition: Partition data into groups of similar categories or objects.

Cluster: The group in the same category or object.

Different Clusters:

Each of the categories in clusters is similar between them and is dissimilar to the categories of other groups.

Fewer Number of Cluster:

1) Lose: Lose data details;

2) Benefit: Simplification.

The search for the clusters Unsupervised Learning

Clusters Type:

Exclusive Clusters: Any categories or objects belong to only one cluster.

Overlapping Clusters: Category or an object may belong to many clusters.

Probabilistic Clusters: A category or an object belongs to each cluster with a certain probability.

Notations in Rough Set Theory(RST)

Definition 1:- Indiscernibility Relation: IND (B)

Definition 2:- Equivalence Class: [xi ]IND(B)

Definition 3:- Lower Approximation:

Definition 4:- Upper Approximation:

Definition 5:- Roughness:

Definition 6:- Mean Roughness

Definition 7:- Standard Deviation

1) Whole Data Set -> Parent Node U

2) Current Number of Data Set: ->CNC( iterated from 1-K)3)A attributes, Find the attributes have in the same

category 4)Calculate the Roughness of these attributes of this

category.5)Found the mean value of all

these attributes6)Calculate and Store the Standard Deviation of these

attributes 7) The smaller standard

deviation is used for next iteration

8) If the Standard deviation does not match the smaller

value, the next smaller value is taken as the splitting

attribute.9) Perform binary splitting: split the whole dataset into

two clusters9) Use Distance of Relevance

formula to select the cluster(which have largest

distance)

Phase 2 – Classification

Fuzzy Inference: Generating a mapping from a given input to an output using fuzzy logic. Then, the mapping gives a basis, from which decisions can be generated or patterns discerned.

Fuzzy Inference System ：

1) Fuzzification

2) Fuzzy Rules Generation

3) Defuzzification

Fuzzy Inference Process:

1) Membership Functions

2 ） Logical Operations

3 ） If-Then Rules

Fuzzification

Conditions1. All the “Cluster 1 (C-1)” values are compared with “Minimum Limit Value

( ML(C -1)) “. If any values of Cluster 1 values are less than the value ML ,then those values are set as L .2. All the “Cluster 1 (C-1)” values are compared with “Maximum Limit Value

( XL(C -1) ) “. If any values of Cluster 1 values are less than the value XL(C - 1) , then those values are set as H .(C-1)3. If any values of“Cluster1(C-1)”values are greater than the value ML ,and

less than the value XL(C -1) , then those values are set as M .Similarly, make the conditions for other cluster C - 2 also for generating fuzzyvalues.

Fuzzy Rules Generation

General form of Fuzzy Rule: “IF A THEN B”

IF:antecedentTHEN:conclusionThe output values between L and H of the FIS is trained for generating the Fuzzy Rules.

According to the fuzzy values for each feature that are generated in the Fuzzification process, the Fuzzy Rules are also generated.

Defuzzification

Input: The fuzzy set

Output : A single number with value L , M or H (represents whether the given input dataset is in the Low range, Medium range or in the High range.)

The FIS is trained with the use of the Fuzzy Rules and the testing process is done with the help of datasets.

Evaluation metrics

SensitivitySensitivity measures the proportion of actual positives which are correctly identified. It relates to the test‟s ability to identify positive results.

Specificity:Measures the proportion of negatives which are correctly identified. It relates to the ability of the test to identify negative results.

AccuracyFrom the above results, we can easily get the accuracy value using the following formula,

Evaluate the effectiveness of the proposed systemsJustify theoretical and practical developments of these systems

Results and Discussions

The paper used the heart disease data sets: Cleveland, Hungarian and Switzerland

Total Number of Attributes: 76

Generally used 14 attributes:Age, sex, chest pain type, resting blood pressure,serum cholesterol in mg/dl, fasting blood sugar, resting electro-cardiographic results, maximum heart rate achieved, exercise induced angina, ST depression, slope of the peak exercise ST segment, number of major vessels, thal and diagnosis of heart disease.

Clustering Results

The dataset are clustered into two

sets. Red dots->Cluster 1 Blue dots-> Cluster

2 Cross-> Centroids

Cleveland dataset

Graph for the sensitivity, sensitivity and accuracy of Cleveland dataset

Performance evaluation for sensitivity, specificity and accuracy of Cleveland dataset

Iteratio No

Sensitivit (in %)

Specificit (in %)

Accuracy (in %)

1 21 7 302 29 19 373 36 25 444 54 25 455 57 38 476 57 38 507 64 50 548 71 57 599 71 69 6410 79 75 75

Iteratio No

Sensitivit (in %)

Specificit (in %)

Accurac (in %)

1 8 98 15

2 25 98 31

3 68 98 69

4 83 98 85

5 83 98 85

6 93 98 92

7 93 98 92

8 98 98 98

Switzerland dataset

Performance evaluation for sensitivity, specificity and accuracy of Switzerland dataset

Graph for the sensitivity, sensitivity and accuracy of Switzerland dataset

Hungarian Dataset

Iteratio No

Sensitivit (in %)

Specificit (in %)

Accurac (in %)

1 9 26 40

2 9 58 50

3 18 59 54

4 28 63 54

5 37 69 57

6 37 69 57

7 37 73 60

8 46 79 62

9 46 89 69

10 64 98 72

Graph for the sensitivity, sensitivity and accuracy of Hungarian dataset

Performance evaluation for sensitivity, specificity and accuracy of Hungarian

Conclusion

The Switzerland dataset has provided better result, in compared with the other two datasets.

At the highest iteration level, we could achieved good clustering and classification results.

Rough Set Theory was used as clustering algorithmFuzzy logic was used to classify the clusters.

The experimentation was carried out on heart disease datasets

The evaluation metrics of sensitivity, specificity and accuracy for the proposed work was also analyzed.

Result:

Reference:

[1] R.SARAVANA KUMAR, “ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES”，

[2] Duo Chen, Du-Wu Cui, Chao-Xue Wang, and Zhu-Rong Wang, "A Rough Set-Based Hierarchical Clustering Algorithm for Categorical Data", International Journal of Information Technology, Vol.12, No.3, pp. 149-159, 2006

Documents

ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES