15
Use of SAS-Based Natural Language Processing to Identify Incident and Recurrent Malignancies Justin A. Strauss, MA Research Associate III Kaiser Permanente Southern California May 1, 2012 • 2012 HMORN Conference • Seattle, Washington

Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Embed Size (px)

DESCRIPTION

Clinical Informatics

Citation preview

Page 1: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Use of SAS-Based Natural Language Processing to Identify Incident and Recurrent Malignancies

Justin A. Strauss, MAResearch Associate III

Kaiser Permanente Southern California

May 1, 2012 • 2012 HMORN Conference • Seattle, Washington

Page 2: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Co-Authors & Funding

• Chun R. Chao, PhD

• Marilyn L. Kwan, PhD

• Syed A. Ahmed, MD

• Joanne E. Schottinger, MD

• Virginia P. Quinn, PhD

Page 3: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Acknowledgements & Funding• Mayra Martinez, Michelle McGuire, Melissa

Preciado, Nirupa Ghai, and Jeff Slezak (KPSC); Lawrence Kushi (KPNC); Debra Ritzwoller (KPCO); Joan Warren (NCI); Jianyu Rao and Jiaoti Huang (UCLA)

• Funding was provided by KPSC Community Benefit and the Cancer Research Network

Page 4: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Malignancy Identification• Malignancy identification is important for clinical

and epidemiologic cancer research.

• Limited quality and availability of incident and recurrent malignancy data within health plans.

• Delayed availability of incident malignancy data from cancer registries.

• Few registries track cancer recurrences.

• Manual chart abstraction slow and expensive.

• Previous research has shown electronic diagnosis codes (e.g., ICD-9) to be unreliable.

Page 5: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Natural Language Processing• Natural language processing (NLP) can be used to identify

and extract information from electronic clinical text, including incident and recurrent malignancy data.

• Increasing opportunity for NLP with adoption of electronic clinical systems in patient care delivery.

• Despite its potential value in clinical and research settings, NLP usage has been relatively sparse. Contributing factors may include:

• Technical complexity

• Systems integration requirements

• Habitual use of existing methods

Page 6: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

SCENT Overview• A SAS-based coding, extraction, and nomenclature tool

(SCENT) was developed to identify incident and recurrent malignancies using text from pathology reports.

• SCENT is currently being implemented in two research studies at Kaiser Permanente Southern California (KPSC):

• Intervention to improve medication adherence among breast cancer patients.

• Differences in the prognosis of prostate cancer patients according to their genetic factors

• Use of SAS programming minimizes implementation barriers and increases availability for multisite research.

Page 7: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Description of Methods• SCENT identifies non-negated clinical concepts within

pathology report text.

• Built using SAS Base (does not require Text Miner add-on).

• Makes extensive use of SAS hash objects and regular expressions.

• Includes components for preprocessing, matching, negation and uncertainty detection, extracting diagnostic information (e.g., staging and Gleason score), and classifying report malignancy status.

• Flexibility to assign codes using variety of coding systems.

• Validation used subset of SNOMED 3.x (~1000 concepts).

Page 8: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

SCENT Process Diagram

Concept Dictionary (SAS)

Pathology Text (Research Database)Text : Raw text segment from reportLine : Sequential text segment identifier

Regular Expressions

LoopConcepts

Examine Segments

Tokenize Words

[adenocarcinoma[ls]?][papillar(y|ies)]

Extract Data

Code Matches

Tokenize Words

Clean

Enhance

Disease Extent

Diagnostic Certainty

Tumor Staging

Gleason Score

Check Negation

Clinical Concepts (Excel)Type : Morphology, topology, or proceduralCode : SNOMED 3.XClass : Malignant, basaloid, benign, or N/ADescription : Concept description

[intraductal][papillary][adenocarcinoma][with][invasion]

[intraductal][papillary][adenocarcinoma][with][invasion]

[((intra)?duct(al)?)][papillar(y|ies)][adenocarcinoma[ls]?]

[moderately-differentiated ductal adenocarcinoma with papillary][features.][the tumor involves 0.6 cm of one core.]

[moderately-differentiated ductal adenocarcinoma with papillary features.][the tumor involves 0.6 cm of one core.]

Preprocessed TextCode : M-85033

Description : intraductal papillary adenocarcinoma with invasion

[moderately] [differentiated] [ductal] [adenocarcinoma] with [papillary] [features]

moderately differentiated <nlp snm=m85033 type=m class=3>ductal adenocarcinoma with papillary</nlp snm=m85033> features

free (of|from)not? (support[a-z]*|identified)non(?!small|hodgkins)

[((intra)?duct(al)?)]

Match Tokens

Page 9: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Sample Report Coding

LEFT BREAST CORE BIOPSY TWO O CLOCK.<BR>

INVASIVE DUCTAL CARCINOMA NOTTINGHAM GRADE 2.<BR>

NO CALCIFICATION IS IDENTIFIED.<BR>

NO VASCULAR INVASION IS IDENTIFIED.<BR>

HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.

<NLP SNM=T04030 TYPE=T>LEFT BREAST</NLP SNM=T04030> CORE <NLP SNM=P1140 TYPE=P>BIOPSY</NLP SNM=P1140> TWO O CLOCK.<BR>

INVASIVE <NLP SNM=M85003 TYPE=M CLASS=3>DUCTAL CARCINOMA</NLP SNM=M85003> NOTTINGHAM GRADE 2.<BR>

NO CALCIFICATION IS IDENTIFIED.<BR>

NO VASCULAR INVASION IS IDENTIFIED.<BR>

HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.

Preprocessed Text

Coded Text

Page 10: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Validation Study• To validate SCENT, trained chart abstractors reviewed

electronic pathology reports.

• Random samples of breast (n=400) and prostate (n=400) cancer patients.

• Patients diagnosed at KPSC between 2000-2007.

• Reports included from six months post-diagnosis through end of 2008.

• In total, 206 breast and 186 prostate cancer patients contributed 490 and 425 eligible reports, respectively.

• SCENT classifications were compared with those of abstractors.

Page 11: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Classification ConcordanceAbstractor Classifications

Benign CancerRecurrence

OtherPrimary Cancer Suspicious

SCENT Classifications % N % N % N % N Kappa

Breast Cancer (Total) (436) (32) (18) (4)

Benign 99.8 435 - - - - 25.0 1 0.96

Cancer Recurrence - - 100.0 32 - - - -

Other Primary Cancer 0.2 1 - - 100.0 18 50.0 2

Suspicious - - - - - - 25.0 1

Prostate Cancer (Total) (356) (29) (36) (4)

Benign 99.4 354 - - 5.6 2 - - 0.95

Cancer Recurrence - - 96.6 28 2.8 1 - -

Other Primary Cancer 0.6 2 3.4 1 91.7 33 - -

Suspicious - - - - - - 100.0 4

Note: incident contralateral breast malignancies were considered to be recurrences.

Page 12: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

SCENT Performance Metrics

Sensitivity* Specificity* PPV* NPV*

Breast Cancer 1.00 (0.93-1.00) 0.99 (0.98-1.00) 0.94 (0.85-0.98) 1.00 (0.99-1.00)

Prostate Cancer 0.97 (0.89-0.99) 0.99 (0.98-1.00) 0.97 (0.89-0.99) 0.99 (0.98-1.00)

* Shown with Wilson's 95% confidence interval.

Page 13: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Conclusions• Favorable results suggest SCENT can identify and extract

information about primary and recurrent malignancies from pathology reports.• Rapid cancer case identification.

• Improved measurement accuracy of common study endpoint.

• SCENT has the potential to expedite chart reviews by narrowing the search and highlighting relevant concepts.

• Generalized utility for extracting standardized disease scores and other clinical information.

• SCENT is proof of concept for SAS-based NLP that can be easily shared between institutions to support research.

Page 14: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Limitations & Next Steps• SCENT has a number of limitations, including:

• Unable to disambiguate and contextualize identified clinical concepts without part-of-speech (POS) tagging.

• More susceptible to changes in text structure and increased linguistic variability than statistical NLP approaches.

• General purpose NLP (e.g., cTAKES) likely to perform better outside of pathology.

• Next steps include:• Release SCENT source code and requisite support files.

• Optimize current functionality and assess feasibility of adding methods (e.g., POS tagging, n-grams, statistical classifiers).

• Attempt to identify non-pathologically diagnosed malignancies using radiology reports and clinical progress notes.

• Quantify cost savings associated with SCENT-assisted chart reviews.

Page 15: Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Questions?