Download pptx - Sk ghi (wip) 22052014

Evaluating Methods for the Identification of Cancer in

Free-Text Pathology Reports Using alternative

Machine Learning and Data Preprocessing Approaches

Suranga Nath Kasthurirathne

What does that even mean ?

Our problem

• Cancer case reporting to public health registries are often:– Delayed– Incomplete

Our emphasis

• Use pathology reports• Automate it (It actually works !)

Our solution

• Speed• Accuracy• Applicability to other surveillance

activities• Computationally efficient

Issues

• Lots of data

• Lots of FREE-TEXT data• Not enough time• Not enough resources

Clarifications

When I say “We”:

• “We” in terms of decision making and consultation usually means Dr. Grannis

• “We” in terms of implementation and code mongering usually means Suranga

Our basic approach

Solution/s

What improvements are we trying out?

• Alternative data input formats• Candidate decision models• Decision model combinations• HOW to look for Vs. WHAT to look

for

Manual review

• Functions as our source of truth–What ?–Why ?

Manually reviewed 1495 reportsIdentified 371 (24.8%) positive cancer cases

Machine learning process

• Identification of keywords–What ARE keywords ?Metastasis, tumor, malignant, neoplasm, stage, carcinoma and ca

• Identification of negation context• Use of alternate data input formats

What were the different data input formats used ?

• Raw data input• Four state data input

What and Why ?

• Raw

• Four state

So basically

Training / Testing

• What ?• Why cross validation ?

• Alternative decision models– So many options !– Classification vs. Clustering analysis

To preserve my sanity, and because we’re not stupid…

• We used Weka (Waikato Environment for Knowledge Analysis)– is a collection of machine learning

algorithms for data mining tasks– is Open Source !

Decision models used

• Logistic regression• Naïve Bayes• Support vector machine• K-nearest neighbor• Random forest• JT48 J48 decision tree

(Thanks Jamie !!!)

Results

• How do we measure our results ?– Precision

• What % of positive predictions were correct?

– Recall• What % of positive cases were caught?

– Accuracy• What % of predictions were correct?

Precision Vs. Recall. The fine balance

Results contd.…

• RF and NB showed statistically significant lower values for precision

• SVM exhibited statistically significant lower results for recall

• SVM and NB produced statistically significant lower results for accuracy

Overall performance by preprocessed input type

• Raw count is significantly better than four state

Overall performance by decision model

• Ensemble approach is significantly better to individual algorithms

Improvements

Keywords ? sure, I have a list…

Better identification of keywords

Shaun

Problems with Negex…

Results

• The funder is happy… we think• We wrote an abstract !• Feature selection approaches for

keyword identification as an independent study rotation

Our thanks to…

• Dr. Shaun Grannis (RI)• Dr. Brian Dixon (RI)• Dr. Judy Wawira (IUPUI)• Eric Durbin (UKC)

Questions ?