Evaluating Methods for the Identification of Cancer in
Free-Text Pathology Reports Using alternative
Machine Learning and Data Preprocessing Approaches
Suranga Nath Kasthurirathne
What does that even mean ?
Our problem
• Cancer case reporting to public health registries are often:– Delayed– Incomplete
Our emphasis
• Use pathology reports• Automate it (It actually works !)
Our solution
• Speed• Accuracy• Applicability to other surveillance
activities• Computationally efficient
Issues
• Lots of data
• Lots of FREE-TEXT data• Not enough time• Not enough resources
Clarifications
When I say “We”:
• “We” in terms of decision making and consultation usually means Dr. Grannis
• “We” in terms of implementation and code mongering usually means Suranga
Our basic approach
Solution/s
What improvements are we trying out?
• Alternative data input formats• Candidate decision models• Decision model combinations• HOW to look for Vs. WHAT to look
for
Manual review
• Functions as our source of truth–What ?–Why ?
Manually reviewed 1495 reportsIdentified 371 (24.8%) positive cancer cases
Machine learning process
• Identification of keywords–What ARE keywords ?Metastasis, tumor, malignant, neoplasm, stage, carcinoma and ca
• Identification of negation context• Use of alternate data input formats
What were the different data input formats used ?
• Raw data input• Four state data input
What and Why ?
• Raw
• Four state
So basically
Training / Testing
• What ?• Why cross validation ?
• Alternative decision models– So many options !– Classification vs. Clustering analysis
To preserve my sanity, and because we’re not stupid…
• We used Weka (Waikato Environment for Knowledge Analysis)– is a collection of machine learning
algorithms for data mining tasks– is Open Source !
Decision models used
• Logistic regression• Naïve Bayes• Support vector machine• K-nearest neighbor• Random forest• JT48 J48 decision tree
(Thanks Jamie !!!)
Results
• How do we measure our results ?– Precision
• What % of positive predictions were correct?
– Recall• What % of positive cases were caught?
– Accuracy• What % of predictions were correct?
Precision Vs. Recall. The fine balance
Results contd.…
• RF and NB showed statistically significant lower values for precision
• SVM exhibited statistically significant lower results for recall
• SVM and NB produced statistically significant lower results for accuracy
Overall performance by preprocessed input type
• Raw count is significantly better than four state
Overall performance by decision model
• Ensemble approach is significantly better to individual algorithms
Improvements
Keywords ? sure, I have a list…
Better identification of keywords
Shaun
Problems with Negex…
Results
• The funder is happy… we think• We wrote an abstract !• Feature selection approaches for
keyword identification as an independent study rotation
Our thanks to…
• Dr. Shaun Grannis (RI)• Dr. Brian Dixon (RI)• Dr. Judy Wawira (IUPUI)• Eric Durbin (UKC)
Questions ?