30
De-identification of unstructured clinical documents July 13, 2017 NAACCR Annual Conference Albuquerque, NM

De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

Embed Size (px)

Citation preview

Page 1: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

De-identification of unstructured clinical documents

July 13, 2017

NAACCR Annual Conference

Albuquerque, NM

Page 2: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

2

Outline

• Background

• SEER evaluation of de-identification tools

• Next steps

• Conclusion

Page 3: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

3

Use of clinical narrative textMajority of information is in free text formatRegistries collect and store increasing amount of

clinical documentsRegistries generate narrative text Few data elements are abstracted This extremely reach data source can be used for

researchOne major obstacle is that they contain personal

identifying information (PII)

Page 4: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

4

Specific applicationsSEER Virtual Tissue Repository InitiativeSEER Natural Language Processing projectsUses at individual registries

Page 5: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

5

Page 6: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

6

Page 7: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

7

Specific applications: NLP

Providing de-identified reports can accelerate the field of NLP for cancer surveillance.

SEER use of 2500 de-id reports Linguamtics I2E

IBM Watson

HLA

ASCO CancerLink

DeepPhe

Single academic partners

NCI-DOE pilot

Page 8: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

8

SEER Evaluation of De-identification tools

Two studies

Page 9: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

9

De-identification evaluation protocol 5 SEER Registries IRB approvals Pathology report selection 4000 randomly selected from reports received in 2011 800/registry Stratified by cancer site 160 each: breast, lung, crc, prostate and other

IMS provided technical instructions Each registry performed the de-identification Reviewed and compared de-id tool output to original report Recorded number of occurrences PII was missed by PII

category Automated count of de-id phrases by PII category

Page 10: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

10

Performance measurement

• De-identification rate• PII phrase level N de-identified phrases/All PII phrases PII at patient level N patients w/ missed PII/4000 (800) Calculated per each PII category and overall and per

registry

• Limitations N de-id phrases counted based on PII tag (includes over

scrubbing) De-id rates for names of patients and providers cannot

be calculated separately

Page 11: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

11

DE-IDTM

http://www.de-idata.com/

Page 12: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

12

Performance of De-ID™ in five SEER registry

PHI type De-Id phrases N

Missed phrases N

All PHI phrases

PII phrase DeID rate

N pts w/ missed PII

Pt level DeID rate

Names 13030 88 13118 0.993 19 0.995

Dates 8717 31 8748 0.996 23 0.994

Phone Numbers 909 0 909 1.000 0 1.000

Places 1532 0 1532 1.000 0 1.000

Street Addresses 350 10 360 0.972 7 0.998

Zip Codes 844 0 844 1.000 0 1.000

ID Numbers 1358 77 1435 0.946 51 0.987

Total PHI 26740 206 26946 0.992 100 0.975

Path Numbers 1678 1310 2988 0.562 810 0.798

Institutions 1355 1673 3028 0.447 825 0.794

Total de-id info 29773 3189 32962 0.903 1735 0.566

Page 13: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

13

NLM scrubberBeta Version tested

https://scrubber.nlm.nih.gov/

Page 14: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

14

Performance of NLM scrubber in four SEER registry

NLM scrubber tags

N phrases de-id

N phrases missed

Total N phrases

N patients not de-id

De-id rate phrases

De-id patients

Personal name pt name+provider name 5130 0+8 5138 0 0.998 1.000

Address 466 1 467 1 0.998 0.999Alphanumeric ssn+mrn+phone+ path# 1420 0+0+0+179 1599 77 0.888 0.901

Date 1393 1 1394 1 0.999 0.999

Total 8409 189 8598 79 0.978 0.899

Page 15: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

15

Other tools• PARAT, Privacy

Analytics• MIST, MITRE

Page 16: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

16

SummaryReasonable performance for PII (with the

exception of Seattle registry)Suboptimal for Institution and pathology specimen

IDs Inconsistency across reports and registriesRegistries opinion: generally not satisfied

Page 17: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

17

Next steps

PII annotation on representative sample of ePath reports

Customization and testing of high-potential de-identification tools Latest version of NLM scrubber BoB

Page 18: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

18

PII Annotation Protocol for Narrative Clinical Text

Annotation of PII - all PII is clearly marked and categorized in the text CDAP pipeline will be used for annotation Each registry will annotate a sample of reports PII annotated reports will be used for: Customization and training of de-identification tools Validation/testing of the tools prior to deployment Validation/testing each time major revisions/versions of

the tools are introduced

Page 19: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

19

Annotation Process

Page 20: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

20

Proposed metrics/goals

Patient name: > 99% Other names (relatives; providers, etc.): > 98% SSN: 100% Dates: > 98% Other identification numbers (MRN, account #, insurance plan #): >

99% Patient address (street, city, zip code): > 98% Patient phone, fax, email, URL: > 99% Specimen/slide/path report #: > 97% Institution/lab name: > 97% Institution address: > 97%

Page 21: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

21

• Existing de-identification tools generally have good performance

• De-identification rates are as good or better than human de-identification

• Performance decreases with increased variability of reports (multiple institutions)

• Need for customization and testing prior to deployment

• Creation of annotated sample of reports representative of documents corpora is highly suggested

• Governance: Controlled access to the de-identified reports (e.g. DUA) is recommended

Conclusion

Page 22: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

22

Resources

NISTIR 8053: De-Identification of Personal Information (Oct. 2015) http://nvlpubs.nist.gov/nistpubs/ir/2015/NIST.IR.8053.pdf

NIST Special Publications 800-188: De-Identifying Government Datasets (second draft, Dec. 2016) http://csrc.nist.gov/publications/drafts/800-

188/sp800_188_draft2.pdf

Page 23: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

23

Acknowledgments:

SEER registries: CT, HI, KY, NM, and Seattle

NCI: Spencer Morris, Paul Fearn, Steve Friedman

IMS team: Rusty Shields, Dave Annett, Laurie Buck, Linda Coyle

NIH/NLM: Mehmet Kayaalp

USC: Stephane Meystre

Page 24: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

24

Thank you

Questions

Contact Valentina Petkov: [email protected]

Page 25: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

www.cancer.gov www.cancer.gov/espanol

Page 26: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

26

Performance of De-ID™ in SEER Seattle registry

PHI type De-Id phrases N

Missed phrases N

All PHI phrases

PII phrase DeID rate

N pts w/ missed PII

Pt level DeID rate

Names 2972 39 3011 0.987 15 0.981Dates 1520 8 1528 0.995 7 0.991Phone Numbers 113 0 113 1.000 0 1.000Places 255 0 255 1.000 0 1.000

Street Addresses 65 0 65 1.000 0 1.000

Zip Codes 105 0 105 1.000 0 1.000ID Numbers 263 47 310 0.848 24 0.970Total PHI 5293 94 5387 0.983 46 0.943Path Numbers 571 221 792 0.721 140 0.825Institutions 284 809 1093 0.260 350 0.563Total de-id info 6148 1124 7272 0.845 536 0.330

Page 27: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

27

Performance of De-ID™ in SEER Hawaii registry

PHI type De-Id phrases N

Missed phrases N

All PHI phrases

PII phrase DeID rate

N pts w/ missed PII

Pt level DeID rate

Names 2972 45 3017 0.985 2 0.998

Dates 1520 0 1520 1.000 0 1.000

Phone Numbers 113 0 113 1.000 0 1.000

Places 255 0 255 1.000 0 1.000

Street Addresses 65 0 65 1.000 0 1.000

Zip Codes 105 0 105 1.000 0 1.000

ID Numbers 236 0 236 1.000 0 1.000

Total PHI 5266 45 5311 0.992 2 0.998

Path Numbers 571 36 607 0.941 26 0.968

Institutions 284 45 329 0.863 45 0.944

Total de-id info 6121 126 6247 0.980 73 0.906

Page 28: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

28

Performance of De-ID™ in SEER Kentucky registry

PHI type De-Id phrases N

Missed phrases N

All PHI phrases

PII phrase DeID rate

N pts w/ missed PII

Pt level DeID rate

Names 3647 2 3649 0.999 1 0.999

Dates 2974 10 2984 0.997 5 0.994

Phone Numbers 661 0 661 1.000 0 1.000

Places 801 0 801 1.000 0 1.000

Street Addresses 167 10 177 0.944 7 0.991

Zip Codes 559 0 559 1.000 0 1.000

ID Numbers 604 7 611 0.989 4 0.995

Total PHI 9413 29 9442 0.997 17 0.979

Path Numbers 385 57 442 0.871 44 0.945

Institutions 533 521 1054 0.506 186 0.768

Total de-id info 10331 607 10938 0.945 247 0.691

Page 29: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

29

Performance of De-ID™ in SEER Connecticut registry

PHI type De-Id phrases N

Missed phrases N

All PHI phrases

PII phrase DeID rate

N pts w/ missed PII

Pt level DeID rate

Names 451 1 452 0.998 0 1.000

Dates 1022 13 1035 0.987 11 0.986

Phone Numbers 1 0 1 1.000 0 1.000

Places 40 0 40 1.000 0 1.000

Street Addresses 13 0 13 1.000 0 1.000

Zip Codes 15 0 15 1.000 0 1.000

ID Numbers 22 0 22 1.000 0 1.000

Total PHI 1564 14 1578 0.991 11 0.986

Path Numbers 17 472 489 0.035 182 0.767

Institutions 87 254 341 0.255 200 0.744

Total de-id info 1668 740 2408 0.693 393 0.482

Page 30: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing

30

Performance of De-ID™ in SEER New Mexico registry

PHI type De-Id phrases N

Missed phrases N

All PHI phrases

PII phrase DeID rate

N pts w/ missed PII

Pt level DeID rate

Names 2988 1 2989 1.000 1 0.999

Dates 1681 0 1681 1.000 0 1.000

Phone Numbers 21 0 21 1.000 0 1.000

Places 181 0 181 1.000 0 1.000

Street Addresses 40 0 40 1.000 0 1.000

Zip Codes 60 0 60 1.000 0 1.000

ID Numbers 233 23 256 0.910 23 0.971

Total PHI 5204 24 5228 0.995 24 0.970

Path Numbers 134 524 658 0.204 418 0.478

Institutions 167 44 211 0.791 44 0.945

Total de-id info 5505 592 6097 0.903 486 0.393