15
De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel Saltz Center for Comprehensive Informatics

De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

Embed Size (px)

Citation preview

Page 1: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

De-identifying Pathology Reports for Pathology Informatics

James Gardner, Li XiongDepartment of Math and Computer

Science

Fusheng Wang, Andrew Post, Joel Saltz

Center for Comprehensive Informatics

Page 2: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

Introduction

• The HIPAA Privacy Rule regulates the use and disclosure of Protected Health Information (PHI)

• De-identification of pathology reports is of critical importance in order to facilitate secondary use of medical records for research

• HIDE (Health Information DE-identification) is an open-source de-id tool based on advanced statistical based de-identification technologies

Page 3: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

HIPAA Identifiers

1. Names;

2. All geographical subdivisions smaller than a state;

3. All elements of dates (except year);

4. Phone numbers;

5. Fax numbers;

6. Electronic mail addresses;

7. Social Security numbers;

8. Medical record numbers;

9. Health plan beneficiary numbers;

10. Account numbers;

11. Certificate/license numbers;

12. Vehicle identifiers and serial numbers;

13. Device identifiers and serial numbers;

14. Web Universal Resource Locators (URLs);

15. Internet Protocol (IP) address numbers;

16. Biometric identifiers, including finger and voice prints;

17. Full face photographic images or comparable images; and

18. Any other unique identifying number, characteristic, or code

1. Names;

2. All geographical subdivisions smaller than a state;

3. All elements of dates (except year);

4. Phone numbers;

5. Fax numbers;

6. Electronic mail addresses;

7. Social Security numbers;

8. Medical record numbers;

9. Health plan beneficiary numbers;

10. Account numbers;

11. Certificate/license numbers;

12. Vehicle identifiers and serial numbers;

13. Device identifiers and serial numbers;

14. Web Universal Resource Locators (URLs);

15. Internet Protocol (IP) address numbers;

16. Biometric identifiers, including finger and voice prints;

17. Full face photographic images or comparable images; and

18. Any other unique identifying number, characteristic, or code

• These identifiers have to be removed or• Based on the opinion from an qualified

statistical expert, the risk of identifying an individual is very small

Page 4: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

HIDE Overview

• Utilizes the state-of-the-art named entity recognition technique, Conditional Random Fields, for extracting PHI

− Previous tools such as DE-ID and HMS scrubber use rule-based approaches which are labor intensive and not portable

• Provides flexible de-identification options including full de-identification and state-of-the-art statistical de-identification

− Previous tools allow simple removal or substitution of the PHI

• Provides an easy-to-use web-based interface that utilizes the latest web-technologies

• Integrated with caTIES, and caTissue (in progress)

Page 5: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

PHI Extraction

• Utilizes state-of-the-art NLP technique, Conditional Random Fields − High accuracy, easy to train, portable

• Combines different feature sets and sampling techniques− Feature sets: dictionary, affix, regular expression and context

• Can use default models or custom trained models− Web interface for annotating and training custom models− A set of reports are loaded and manually labeled− The labeled documents will generate a trained model for

automatically de-identifying new reports

Page 6: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

HIDE: De-identification Options

• Full de-identification− safe-harbor, all 18 HIPAA identifiers removed or substituted

• Partial de-identification− limited dataset, all direct HIPAA identifiers removed or

substituted(not for dates, address other than street/P.O.Box)

• Configurable de-identification− A configurable set of identifiers removed or substituted

• Statistical de-identification− Advanced anonymization that guarantees rigorous

statistically acceptable privacy while keeping the utility of the data

Page 7: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

Statistical De-identification Example

De-identification satisfying k-anonymity (k=2) (every record is indistinguishable in a group of records with size greater than or equal to k)

Page 8: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

(100 reports,10-fold cross validation)

Study 1: PHI Extraction on Emory Pathology Reports

Precision: true positives over the sum of true positives and false positivesRecall (sensitivity): true positives over total actual positivesF1: combination: 2*precision*recall/(precision+ recall)

Page 9: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

Study 2: PHI Extraction on i2b2 Reports

• Based on 669 discharge summaries, 10-fold cross validation

• Good precision and recall for most individual PHI identifiers

• Good overall precision and recall for PHI extraction

Page 10: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

Study 3: Impact of Different Feature Sets

Dictionary (d), affix (a), regular expression (r) and context (c) features are in order of increasing importance for statistical CRF based PHI extraction

Page 11: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

Integrating HIDE with caTIES

• caTIES (cancer Text Information Extraction System) provides tools for de-identification and automated coding of free-text pathology reports

• caTIES provides de-id extensibility through implementing its CaTIES_DeIdentifier interface

• HIDEDeIdentifier, which calls HIDE client API

• Added HIDE de-id option in caTIES installer

• HIDE is bundled with caTIES since release v3.7 (May 2010)

Page 12: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

Integrating HIDE with caTissue (in Progress)

• caTissue uses caTIES V2.x and refactored it into caTissue’s workflow

• HIDE integration with caTissue is similar to caTIES

• Implementation and evaluation under going

• Goal: Integration of pathology reports into caTissue installation at Winship Cancer Institute at Emory University

Page 13: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

Ongoing Development

• Continue development on HIDE/caTissue integration

• Usability improvement: simplified installation progress

• System improvements− Efficiency and scalability of the system

− Multiple file formats support

− Additional statistical de-identification options

Page 14: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

HIDE Demo

http://www.mathcs.emory.edu/hide/demos

Page 15: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel

Thank you

http://www.mathcs.emory.edu/hide

Li Xiong ([email protected])