Upload
nasnan
View
35
Download
0
Embed Size (px)
DESCRIPTION
ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES. Yiwen Fan. Purpose: Warehouse medical databases:. Clinical databases: have accumulated large quantities of information about patients and their medical conditions - PowerPoint PPT Presentation
Citation preview
ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES
Yiwen Fan
Purpose: Warehouse medical databases:
Clinical databases: have accumulated large quantities of information about patients and their medical conditions
Warehouse these databases and to analyze the patient’s condition: we need an efficient data mining technique.
Data Mining Process: Data warehousing, data query and cleaning, and data analysis.
Three major data mining Techniques
RegressionClustering
Classification
Techniques used in this paper
Two phase: Clustering and Classification
First phase: To Use Rough Set Theory for Clustering ( clustering technique will reduce the complexity of the RST result)
Second phase: Using Fuzzy Logic to classify the result of the clusters.
Rough Set Theory (RST): Cluster
Fuzzy logic: Classification
Definition of Clustering: A kind of data mining techniques for warehousing the heterogeneous database. And it is used to group data that have similar characteristics in the same cluster and also group the data that have dissimilar characteristics with other clusters. (used to handle uncertainty and incomplete information)
Previous clustering techniques :
K-Means Expectation Maximization
Association RuleK-Prototype
Fuzzy K-Modes etc.
Phase 1 – Clustering
Definition: Partition data into groups of similar categories or objects.
Cluster: The group in the same category or object.
Different Clusters:
Each of the categories in clusters is similar between them and is dissimilar to the categories of other groups.
Fewer Number of Cluster:
1) Lose: Lose data details;
2) Benefit: Simplification.
The search for the clusters Unsupervised Learning
Clusters Type:
Exclusive Clusters: Any categories or objects belong to only one cluster.
Overlapping Clusters: Category or an object may belong to many clusters.
Probabilistic Clusters: A category or an object belongs to each cluster with a certain probability.
Notations in Rough Set Theory(RST)
Definition 1:- Indiscernibility Relation: IND (B)
Definition 2:- Equivalence Class: [xi ]IND(B)
Definition 3:- Lower Approximation:
Definition 4:- Upper Approximation:
Definition 5:- Roughness:
Definition 6:- Mean Roughness
Definition 7:- Standard Deviation
1) Whole Data Set -> Parent Node U
2) Current Number of Data Set: ->CNC( iterated from 1-K)3)A attributes, Find the attributes have in the same
category 4)Calculate the Roughness of these attributes of this
category.5)Found the mean value of all
these attributes6)Calculate and Store the Standard Deviation of these
attributes 7) The smaller standard
deviation is used for next iteration
8) If the Standard deviation does not match the smaller
value, the next smaller value is taken as the splitting
attribute.9) Perform binary splitting: split the whole dataset into
two clusters9) Use Distance of Relevance
formula to select the cluster(which have largest
distance)
Phase 2 – Classification
Fuzzy Inference: Generating a mapping from a given input to an output using fuzzy logic. Then, the mapping gives a basis, from which decisions can be generated or patterns discerned.
Fuzzy Inference System :
1) Fuzzification
2) Fuzzy Rules Generation
3) Defuzzification
Fuzzy Inference Process:
1) Membership Functions
2 ) Logical Operations
3 ) If-Then Rules
Fuzzification
Conditions1. All the “Cluster 1 (C-1)” values are compared with “Minimum Limit Value
( ML(C -1)) “. If any values of Cluster 1 values are less than the value ML ,then those values are set as L .2. All the “Cluster 1 (C-1)” values are compared with “Maximum Limit Value
( XL(C -1) ) “. If any values of Cluster 1 values are less than the value XL(C - 1) , then those values are set as H .(C-1)3. If any values of“Cluster1(C-1)”values are greater than the value ML ,and
less than the value XL(C -1) , then those values are set as M .Similarly, make the conditions for other cluster C - 2 also for generating fuzzyvalues.
Fuzzy Rules Generation
General form of Fuzzy Rule: “IF A THEN B”
IF:antecedentTHEN:conclusionThe output values between L and H of the FIS is trained for generating the Fuzzy Rules.
According to the fuzzy values for each feature that are generated in the Fuzzification process, the Fuzzy Rules are also generated.
Defuzzification
Input: The fuzzy set
Output : A single number with value L , M or H (represents whether the given input dataset is in the Low range, Medium range or in the High range.)
The FIS is trained with the use of the Fuzzy Rules and the testing process is done with the help of datasets.
Evaluation metrics
SensitivitySensitivity measures the proportion of actual positives which are correctly identified. It relates to the test‟s ability to identify positive results.
Specificity:Measures the proportion of negatives which are correctly identified. It relates to the ability of the test to identify negative results.
AccuracyFrom the above results, we can easily get the accuracy value using the following formula,
Evaluate the effectiveness of the proposed systemsJustify theoretical and practical developments of these systems
Results and Discussions
The paper used the heart disease data sets: Cleveland, Hungarian and Switzerland
Total Number of Attributes: 76
Generally used 14 attributes:Age, sex, chest pain type, resting blood pressure,serum cholesterol in mg/dl, fasting blood sugar, resting electro-cardiographic results, maximum heart rate achieved, exercise induced angina, ST depression, slope of the peak exercise ST segment, number of major vessels, thal and diagnosis of heart disease.
Clustering Results
The dataset are clustered into two
sets. Red dots->Cluster 1 Blue dots-> Cluster
2 Cross-> Centroids
Cleveland dataset
Graph for the sensitivity, sensitivity and accuracy of Cleveland dataset
Performance evaluation for sensitivity, specificity and accuracy of Cleveland dataset
Iteratio No
Sensitivit (in %)
Specificit (in %)
Accuracy (in %)
1 21 7 302 29 19 373 36 25 444 54 25 455 57 38 476 57 38 507 64 50 548 71 57 599 71 69 6410 79 75 75
Iteratio No
Sensitivit (in %)
Specificit (in %)
Accurac (in %)
1 8 98 15
2 25 98 31
3 68 98 69
4 83 98 85
5 83 98 85
6 93 98 92
7 93 98 92
8 98 98 98
Switzerland dataset
Performance evaluation for sensitivity, specificity and accuracy of Switzerland dataset
Graph for the sensitivity, sensitivity and accuracy of Switzerland dataset
Hungarian Dataset
Iteratio No
Sensitivit (in %)
Specificit (in %)
Accurac (in %)
1 9 26 40
2 9 58 50
3 18 59 54
4 28 63 54
5 37 69 57
6 37 69 57
7 37 73 60
8 46 79 62
9 46 89 69
10 64 98 72
Graph for the sensitivity, sensitivity and accuracy of Hungarian dataset
Performance evaluation for sensitivity, specificity and accuracy of Hungarian
Conclusion
The Switzerland dataset has provided better result, in compared with the other two datasets.
At the highest iteration level, we could achieved good clustering and classification results.
Rough Set Theory was used as clustering algorithmFuzzy logic was used to classify the clusters.
The experimentation was carried out on heart disease datasets
The evaluation metrics of sensitivity, specificity and accuracy for the proposed work was also analyzed.
Result:
Reference:
[1] R.SARAVANA KUMAR, “ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES”,
[2] Duo Chen, Du-Wu Cui, Chao-Xue Wang, and Zhu-Rong Wang, "A Rough Set-Based Hierarchical Clustering Algorithm for Categorical Data", International Journal of Information Technology, Vol.12, No.3, pp. 149-159, 2006