Upload
george-kalangi
View
1.025
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Alzheimer's Disease-‐ Clinical Data Classifica4on
By George Kalangi
Venkata Gopi
Overview: • Introduc4on • Analysis of commonly used terms and explana4on of data sets
• Overall Programming Process
• Genera4ng a merged file with CDGLOBAL
• Genera4on of files for future status predic4on • Data Preprocessing • Classifica4on (Algorithms) used on the data
• Analysis on the output data from WEKAb
G
Introduc4on • What is Alzheimer’s Disease? • Brain disorder • Most common form of demen4a
– Term for the loss • Memory • Other intellectual abili4es • Serious enough to interfere with daily life
• Clinical Demen4a Ra4o (0,0.5,1,2,3)
Mild to Severe Dementia 1.0 to 3.0 Questionable Dementia 0.5
Normal 0
G
Datasets (60 Files)
" 56 comma separated files 1 File – Data Dic4onary (Explains the terms used)
1 File – Clinical Demen4a Ra4ng (Has CDGLOBAL)
Rest Assessments Data Defini4ons
Other like visits having abbrevia4ons
G
Environment Setup
• Programming Languages used for the project are PHP, MySQL, Java, Postgresql
• Tools used are WEKA (Waikato Environment for Knowledge Analysis), MySQLWorkBench,
and NetBeans
• -‐Front End (PHP) • -‐Back End (MySQL)
G
V
Overall Programming Process
• A selected dataset (FAQ) is given by the user. • At the backend MYSQL queries are defined enough to create the required tables and insert the required data to the corresponding tables.
• Here aeer the required opera4ons are performed on the tables.
• Final output files are stored in .csv format.
G
V
Genera4ng a merged file with CDGLOBAL (For current)
• For the given datasets as input, (Eg:adni_faq_2011-‐01-‐20.csv) and from the adni_cdr_2011-‐01-‐20.csv) file
-‐-‐the RID’s and VISCODE’s of faq and cdr are compared and based on that CDGLOBAL column in cdr file is merged to faq file.
• During Remove CDGLOBAL which has -‐1 and VISCODE’s f,nv,uns1 are trimmed off.
Result file is “Merged_dataset_file.csv”
G
Query used for genera4ng merged file: • Select f.cID ,f.RID ,f.VISCODE ,f.EXAMDATE ,f.FAQSOURCE,f.FAQFINAN,f.FAQFORM,f.FAQSHOP,f.FAQGAME,f.FAQBEVG,f.FAQMEAL,f.FAQEVENT,f.FAQTV,f.FAQREM,f.FAQTRAVL,f.FAQTOTAL ,cdr.cdglobal from cdr,faq f where cdr.rid=f.rid and cdr.VISCODE=f.VISCODE and cdr.cdglobal not in (-‐1)";
G
Genera4on of files for future status predic4on
• Predic4on dataset is generated by mapping the first 4me visit to the 6 month’s Class and 6 month visit to the 12 month’s Class and so on.
• SQL query opera4ons are performed on the merged file to separate the 6 month’s 4me interval classes.
• Following are the files generated: -‐ File_dataset_m06.csv
-‐File_dataset_m12.csv and so on
V
Query used for genera4ng class files: • Select v.ID as ID,v.RID as RID,v.VISCODE ,v.EXAMDATE,v.FAQSOURCE ,v.FAQFINAN ,v.FAQFORM ,v.FAQSHOP ,v.FAQGAME ,v.FAQBEVG,v.FAQMEAL ,v.FAQEVENT ,v.FAQTV ,v.FAQREM ,v.FAQTRAVL ,v.FAQTOTAL ,m12.cdrglobal from `table_adni_faq_2011-‐01-‐20_m06` v,`table_adni_faq_2011-‐01-‐20_m12` m12 where v.rid=m12.rid
V
Preprocessing • Aeer we get required .csv files, we use WEKA to preprocess the data.
• Load the file into WEKA.
• Apply Filter “weka.filters.unsuperwised.apributes.Remove” to trim off the unused fields.
• Apply “NumericaltoNominal” to convert all the values in the data to Nominal before classifying and fetching to a classifier algorithm.
G
Classifica4on Algorithms Used
• The Classify panel enables the user to apply classifica4on and regression algorithms (indiscriminately called classifiers in Weka) to the resul4ng dataset, to es4mate the accuracy of the resul4ng predic4ve model.
• J48 uses C4.5 (a successor of ID3) Algorithm
• Naïve Bayesian Classifica4on Algorithm
G
What is classifica4on? • Given a collec4on of records (training set )
– Each record contains a set of a"ributes, one of the apributes is the class
-‐-‐ A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Example: If we have items in a house which are not classified then we can’t arrange
items in our house.
We classify the items depending on their usage as cooking items, decora4on items etc., such that we could arrange them accordingly and can use it in an efficient and easier way.
G
Decision Tree Classifica/on Task G
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Test Data
Assign Cheat to “No”
Decision Tree
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Test Data
G
J 48 uses C 4.5 Algorithm
• Decision trees represent a supervised approach to classifica4on
• Decision trees are a classic way to represent informa4on from a machine learning algorithm, and offer a fast and powerful way to express structures in data.
• A decision tree is a simple structure where non-‐terminal nodes represent tests on one or more apributes and terminal nodes reflect decision outcomes.
• The basic algorithm described above recursively classifies un4l each leaf is pure, meaning that the data has been categorized as close to perfectly as possible.
• The latest public domain implementa4on of Quinlan's model is C4.5. The Weka classifier package has its own version of C4.5 known as J48.
• This process ensures maximum accuracy on the training data.
Why decision tree Algorithm? • Advantages:
– Inexpensive to construct – Easy to interpret for small-‐sized trees – Accuracy is comparable to other classifica4on techniques for many simple data sets
– There could be more than one tree possible for the same data
• Disadvantages: -‐ Under fivng: when the model is too simple, both training and test errors are large
All about Cross Valida4on • We perform cross valida4on when amount of data is small and we
need to have independent training and test set from it.
• It is important that each class is represented in its actual propor4ons in the training and test sets: Stra4fica4on
• An important cross valida4on technique is stra4fied 10 fold cross valida4on, where the instance set is divided into 10 folds.
• We have 10 itera4ons with taking different single fold for tes4ng and the rest for training.
V
Evalua4on
• Metrics for Performance Evalua4on – How to evaluate the performance of a model?
• Methods for Model Comparison – How to compare the rela4ve performance among compe4ng models?
V
Metrics for Performance Evalua4on: Confusion Matrix
• A confusion matrix contains informa4on about actual and predicted classifica4ons done by a classifica4on system. Performance of systems is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two class classifier:
• We get confusion matrix aeer supplying data to a Classifier
• Based on the confusion matrix we can evaluate using the measures like, precision, F-‐measure, accuracy and Recall.
G
Example • Suppose there are a sample of 27 animals — 8 cats, 6 dogs, and 13 rabbits.
• Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class.
• We can see from the matrix that the system in ques4on has trouble dis4nguishing between cats and dogs, but can make the dis4nc4on between rabbits and other types of animals prepy well.
• All correct guesses are located in the diagonal of the table, so it's easy to visually inspect the table for errors, as they will be represented by any non-‐zero values outside the diagonal.
G
Limita4on of accuracy Limita/on of accuracy:
• Consider a 2-‐class problem
– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10
• If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
– It has some disadvantages as a performance es4mate. For example, if there were 95 cats and only 5 dogs in the data set, the classifier could easily be biased into classifying all the samples as cats. The overall accuracy would be 95%, but in prac4ce the classifier would have a 100% recogni4on rate for the cat class but a 0% recogni4on rate for the dog class, so you'll probably want to look at some of the other numbers. ROC Area, or area under the ROC curve, is also taken as preferred measure.
– Accuracy is misleading because model does not detect any class 1 example.
G
Metrics for Evalua4on • Accuracy: The accuracy (AC) is the propor4ons of the total number of
predic4ons that were correct, what percentage of people were correctly classified. It is determined using the equa4on:
Accuracy = (# True Posi4ves + # True Nega4ves) / N
Where N = Total # predic4ons.
• Precision: Finally, precision (P) is the propor4on of the predicted posi4ve cases that were correct. Of all the people that are classified as demented, what percentage of them is actually demented?
It is calculated using the equa4on
Precision = (# True Posi4ves) / (# True Posi4ves + # False Posi4ve)
€
Accuracy =TP +TN
TP +TN + FP + FNV
Evalua4on
• F-‐measure:
F-‐measure =2* (# True Posi4ves ) / ( # 2*True Posi4ves + # True Nega4ves + #False Posi4ves)
• Recall: Recall is the ra4o of the number of true posi4ves and the sum of true posi4ves and false nega4ves. It is calculated using the equa4on:
Recall = (# True Posi4ves) / (# True Posi4ves + # False Nega4ves)
V
Methods for Model Comparison ROC (Receiver Opera/ng Characteris/c)
• Developed in 1950s for signal detec4on theory to analyze noisy signals – Characterize the trade-‐off between posi4ve hits and false alarms
• ROC curve plots TP (on the y-‐axis) against FP (on the x-‐axis)
V
Using ROC for Model Comparison M1 is better for small
FPR M2 is better for large
FPR
A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:.
.90-1 = excellent (A). .80-.90 = good (B). .70-.80 = fair (C). .60-.70 = poor (D). .50-.60 = fail (F) Area Under the ROC curve A
V
Naïve Bayes • It is a simple probabilis4c classifier based on applying bayes theorem with
independence assump4ons. Naive Bayes classifier assumes that the presence (or absence) of a par4cular feature of a class is unrelated to the presence (or absence) of any other feature.
• For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these proper4es to independently contribute to the probability that this fruit is an apple.
• An advantage of the naive Bayes classifier is that it requires a small amount of training data to es4mate the parameters (means and variances of the variables) necessary for classifica4on. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the en4re set. Best suited for apributes, which are independent. It is very simple, very fast.
V
Challenges faced
• Ini4ally data files all being processed using JDBC and MySQL and later its been found to be hec4c if at all other dataset being used. Hence PHP based MYSQL is used which is generalized for all datasets.
• Table crea4on ini4ally for loading the data, later done with file opera4ng func4ons.
• Running all the “MYSQL” commands sequen4ally, later enhanced using php as front end.
• Ini4ally J48 tree was not able to process due to the data being in numerical values. Later done by Discre4za4on/NumericaltoNominal of CDGLobal columns.
V G
Preprocess Output G
Result file for current status(J48) G
Current status (Naïve Bayes) V
Future status (J48) V
Future status (Naïve Bayes) V
MMSE (J48)
References:
http://kent.dl.sourceforge.net/project/weka/documentation/3.6.x/WekaManual-3-6-2.pdf
http://www.dfki.de/~kipp/seminar_ws0607/reports/RossenDimov.pdf
http://stackoverflow.com/questions/2903933/how-to-interpret-weka-classification
http://www.slideshare.net/dataminingtools/weka-credibility-evaluating-whats-been-learned
Thank you