27
Evaluating Methods for the Identification of Cancer in Free-Text Pathology Reports Using alternative Machine Learning and Data Preprocessing Approaches Suranga Nath Kasthurirathne

Sk ghi (wip) 22052014

Embed Size (px)

DESCRIPTION

Presentation on Evaluating Methods for the Identification of Cancer in  Free-Text Pathology Reports Using alternative Machine Learning and Data Preprocessing Approaches

Citation preview

Page 1: Sk ghi (wip) 22052014

Evaluating Methods for the Identification of Cancer in 

Free-Text Pathology Reports Using alternative

Machine Learning and Data Preprocessing Approaches

Suranga Nath Kasthurirathne

Page 2: Sk ghi (wip) 22052014

What does that even mean ?

Page 3: Sk ghi (wip) 22052014

Our problem

• Cancer case reporting to public health registries are often:– Delayed– Incomplete

Page 4: Sk ghi (wip) 22052014

Our emphasis

• Use pathology reports• Automate it (It actually works !)

Our solution

• Speed• Accuracy• Applicability to other surveillance

activities• Computationally efficient

Page 5: Sk ghi (wip) 22052014

Issues

• Lots of data

• Lots of FREE-TEXT data• Not enough time• Not enough resources

Page 6: Sk ghi (wip) 22052014

Clarifications

When I say “We”:

• “We” in terms of decision making and consultation usually means Dr. Grannis

• “We” in terms of implementation and code mongering usually means Suranga

Page 7: Sk ghi (wip) 22052014

Our basic approach

Page 8: Sk ghi (wip) 22052014

Solution/s

What improvements are we trying out?

• Alternative data input formats• Candidate decision models• Decision model combinations• HOW to look for Vs. WHAT to look

for

Page 9: Sk ghi (wip) 22052014

Manual review

• Functions as our source of truth–What ?–Why ?

Manually reviewed 1495 reportsIdentified 371 (24.8%) positive cancer cases

Page 10: Sk ghi (wip) 22052014

Machine learning process

• Identification of keywords–What ARE keywords ?Metastasis, tumor, malignant, neoplasm, stage, carcinoma and ca

• Identification of negation context• Use of alternate data input formats

Page 11: Sk ghi (wip) 22052014

What were the different data input formats used ?

• Raw data input• Four state data input

What and Why ?

Page 12: Sk ghi (wip) 22052014

• Raw

• Four state

Page 13: Sk ghi (wip) 22052014

So basically

Page 14: Sk ghi (wip) 22052014

Training / Testing

• What ?• Why cross validation ?

• Alternative decision models– So many options !– Classification vs. Clustering analysis

Page 15: Sk ghi (wip) 22052014

To preserve my sanity, and because we’re not stupid…

• We used Weka (Waikato Environment for Knowledge Analysis)– is a collection of machine learning

algorithms for data mining tasks– is Open Source !

Page 16: Sk ghi (wip) 22052014

Decision models used

• Logistic regression• Naïve Bayes• Support vector machine• K-nearest neighbor• Random forest• JT48 J48 decision tree

(Thanks Jamie !!!)

Page 17: Sk ghi (wip) 22052014
Page 18: Sk ghi (wip) 22052014

Results

• How do we measure our results ?– Precision

• What % of positive predictions were correct?

– Recall• What % of positive cases were caught?

– Accuracy• What % of predictions were correct?

Precision Vs. Recall. The fine balance

Page 19: Sk ghi (wip) 22052014

Results contd.…

• RF and NB showed statistically significant lower values for precision

• SVM exhibited statistically significant lower results for recall

• SVM and NB produced statistically significant lower results for accuracy

Page 20: Sk ghi (wip) 22052014

Overall performance by preprocessed input type

• Raw count is significantly better than four state

Page 21: Sk ghi (wip) 22052014

Overall performance by decision model

• Ensemble approach is significantly better to individual algorithms

Page 22: Sk ghi (wip) 22052014

Improvements

Page 23: Sk ghi (wip) 22052014

Keywords ? sure, I have a list…

Better identification of keywords

Shaun

Page 24: Sk ghi (wip) 22052014

Problems with Negex…

Page 25: Sk ghi (wip) 22052014

Results

• The funder is happy… we think• We wrote an abstract !• Feature selection approaches for

keyword identification as an independent study rotation

Page 26: Sk ghi (wip) 22052014

Our thanks to…

• Dr. Shaun Grannis (RI)• Dr. Brian Dixon (RI)• Dr. Judy Wawira (IUPUI)• Eric Durbin (UKC)

Page 27: Sk ghi (wip) 22052014

Questions ?