De-identifying Pathology Reports for Pathology Informatics

De-identifying Pathology Reports for Pathology Informatics

James Gardner, Li XiongDepartment of Math and Computer

Science

Fusheng Wang, Andrew Post, Joel Saltz

Center for Comprehensive Informatics

Introduction

• The HIPAA Privacy Rule regulates the use and disclosure of Protected Health Information (PHI)

• De-identification of pathology reports is of critical importance in order to facilitate secondary use of medical records for research

• HIDE (Health Information DE-identification) is an open-source de-id tool based on advanced statistical based de-identification technologies

HIPAA Identifiers

1. Names;

2. All geographical subdivisions smaller than a state;

3. All elements of dates (except year);

4. Phone numbers;

5. Fax numbers;

6. Electronic mail addresses;

7. Social Security numbers;

8. Medical record numbers;

9. Health plan beneficiary numbers;

10. Account numbers;

11. Certificate/license numbers;

12. Vehicle identifiers and serial numbers;

13. Device identifiers and serial numbers;

14. Web Universal Resource Locators (URLs);

15. Internet Protocol (IP) address numbers;

16. Biometric identifiers, including finger and voice prints;

17. Full face photographic images or comparable images; and

18. Any other unique identifying number, characteristic, or code

1. Names;

2. All geographical subdivisions smaller than a state;

3. All elements of dates (except year);

4. Phone numbers;

5. Fax numbers;

6. Electronic mail addresses;

7. Social Security numbers;

8. Medical record numbers;

9. Health plan beneficiary numbers;

10. Account numbers;

11. Certificate/license numbers;

12. Vehicle identifiers and serial numbers;

13. Device identifiers and serial numbers;

14. Web Universal Resource Locators (URLs);

15. Internet Protocol (IP) address numbers;

16. Biometric identifiers, including finger and voice prints;

17. Full face photographic images or comparable images; and

18. Any other unique identifying number, characteristic, or code

• These identifiers have to be removed or• Based on the opinion from an qualified

statistical expert, the risk of identifying an individual is very small

HIDE Overview

• Utilizes the state-of-the-art named entity recognition technique, Conditional Random Fields, for extracting PHI

− Previous tools such as DE-ID and HMS scrubber use rule-based approaches which are labor intensive and not portable

• Provides flexible de-identification options including full de-identification and state-of-the-art statistical de-identification

− Previous tools allow simple removal or substitution of the PHI

• Provides an easy-to-use web-based interface that utilizes the latest web-technologies

• Integrated with caTIES, and caTissue (in progress)

PHI Extraction

• Utilizes state-of-the-art NLP technique, Conditional Random Fields − High accuracy, easy to train, portable

• Combines different feature sets and sampling techniques− Feature sets: dictionary, affix, regular expression and context

• Can use default models or custom trained models− Web interface for annotating and training custom models− A set of reports are loaded and manually labeled− The labeled documents will generate a trained model for

automatically de-identifying new reports

HIDE: De-identification Options

• Full de-identification− safe-harbor, all 18 HIPAA identifiers removed or substituted

• Partial de-identification− limited dataset, all direct HIPAA identifiers removed or

substituted(not for dates, address other than street/P.O.Box)

• Configurable de-identification− A configurable set of identifiers removed or substituted

• Statistical de-identification− Advanced anonymization that guarantees rigorous

statistically acceptable privacy while keeping the utility of the data

Statistical De-identification Example

De-identification satisfying k-anonymity (k=2) (every record is indistinguishable in a group of records with size greater than or equal to k)

(100 reports,10-fold cross validation)

Study 1: PHI Extraction on Emory Pathology Reports

Precision: true positives over the sum of true positives and false positivesRecall (sensitivity): true positives over total actual positivesF1: combination: 2*precision*recall/(precision+ recall)

Study 2: PHI Extraction on i2b2 Reports

• Based on 669 discharge summaries, 10-fold cross validation

• Good precision and recall for most individual PHI identifiers

• Good overall precision and recall for PHI extraction

Study 3: Impact of Different Feature Sets

Dictionary (d), affix (a), regular expression (r) and context (c) features are in order of increasing importance for statistical CRF based PHI extraction

Integrating HIDE with caTIES

• caTIES (cancer Text Information Extraction System) provides tools for de-identification and automated coding of free-text pathology reports

• caTIES provides de-id extensibility through implementing its CaTIES_DeIdentifier interface

• HIDEDeIdentifier, which calls HIDE client API

• Added HIDE de-id option in caTIES installer

• HIDE is bundled with caTIES since release v3.7 (May 2010)

Integrating HIDE with caTissue (in Progress)

• caTissue uses caTIES V2.x and refactored it into caTissue’s workflow

• HIDE integration with caTissue is similar to caTIES

• Implementation and evaluation under going

• Goal: Integration of pathology reports into caTissue installation at Winship Cancer Institute at Emory University

Ongoing Development

• Continue development on HIDE/caTissue integration

• Usability improvement: simplified installation progress

• System improvements− Efficiency and scalability of the system

− Multiple file formats support

− Additional statistical de-identification options

HIDE Demo

http://www.mathcs.emory.edu/hide/demos

http://www.mathcs.emory.edu/hide/demos

Thank you

http://www.mathcs.emory.edu/hide

Li Xiong ([email protected])

http://www.mathcs.emory.edu/hide

Documents

De-identifying Pathology Reports for Pathology Informatics