35
1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley [email protected]

1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley [email protected]

Embed Size (px)

Citation preview

Page 1: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

1

ODIE Toolkit

NCBO Council Talk

December 18, 2007

Rebecca Crowley

[email protected]

Page 2: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 2/35

Outline

Overview of the Project Aims, People, Organization, Domain, Philosophy

Specific Aims from a use case approach Information Extraction Ontology Enrichment

First steps, synergies, and year 1work, working together

Page 3: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 3/35

Project Overview

Funded by National Cancer Center Develop tools for

Information extraction from clinical text using ontologies Enrichment of ontologies using clinical text

Project Period: 9/27/2007 – 7/31/2011 Collaboration with National Center for Biomedical

Ontology Subcontract to Stanford (consultation on Bioportal) Subcontract to Mayo (Terminologies, NLP)

Page 4: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 4/35

Specific Aims Specific Aim 1: Develop and evaluate methods for information extraction

(IE) tasks using existing OBO ontologies, including:1. Named Entity Recognition2. Co-reference Resolution 3. Discourse Reasoning4. Attribute Value Extraction

Specific Aim 2: Develop and evaluate general methods for clinical-text mining to assist in ontology development, including:1. Preprocessing2. Concept Discovery and Clustering3. Suggest taxonomic positioning and relationships4. Specific Aim 3: Develop reusable software for performing information

extraction and ontology development leveraging existing NCBO tools and compatible with NCBO architecture.

Specific Aim 4: Enhance National Cancer Institute Thesaurus Ontology using the ODIE toolkit.

Specific Aim 5: Test the ability of the resulting software and ontologies to address important translational research questions in hematologic cancers.

Year 1 development goals

Page 5: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 5/35

Dual Proposal Goals

Page 6: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 6/35

People @pitt

Wendy Chapman, co-I

Rebecca Crowley, PI

Preet Chaudhary, co-I

Kaihong Liu, Graduate Student

Kevin Mitchell, Architect

Girish Chavan, Interfaces

John Dowling, Annotation

Page 7: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 7/35

Organization

Annotations Algorithms Architecture

Rebecca Crowley Wendy ChapmanKaihong LiuJohn Dowling

Rebecca Crowley Wendy ChapmanKaihong LiuKevin Mitchell

Rebecca Crowley Kevin MitchellGirish Chavan

Develop manually annotated sets for trainingand testing

Consider and test existing algorithms; design, implement and test new algorithms

Develop and implement architecture

Page 8: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 8/35

Domain

Will attempt to develop general tools whenever possible

Priorities for evaluation of components in : Radiology and pathology reports NCIT as well as other clinically relevant OBO

ontologies Cancer domains (including hematologic oncology)

Page 9: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 9/35

Toolkit for developers of NLP applications and ontologies

Support interaction and experimentation Package systems at the conclusion of working

with ODIE Foster cycle of enrichment and extraction needed

to advance development of NLP systems Ontology enrichment as opposed to denovo

development Human-machine collaboration as opposed to fully

automated learning

Philosophy

Page 10: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 10/35

Specific Aims Specific Aim 1: Develop and evaluate methods for information extraction

(IE) tasks using existing OBO ontologies, including:1. Named Entity Recognition2. Co-reference Resolution 3. Discourse Reasoning4. Attribute Value Extraction

Specific Aim 2: Develop and evaluate general methods for clinical-text mining to assist in ontology development, including:1. Preprocessing2. Concept Discovery and Clustering3. Suggest taxonomic positioning and relationships4. Specific Aim 3: Develop reusable software for performing information

extraction and ontology development leveraging existing NCBO tools and compatible with NCBO architecture.

Specific Aim 4: Enhance National Cancer Institute Thesaurus Ontology using the ODIE toolkit.

Specific Aim 5: Test the ability of the resulting software and ontologies to address important translational research questions in hematologic cancers.

Key ODIE Functionality

Page 11: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 11/35

Named Entity Recognition User has

clinical documents one or more ontology (and/or) one or more lexical resources (synonyms, POS) (optionally) a reference standard of human annotations

User wants to determine degree of coverage of different ontologies with text determine degree of overlap in annotations generated between

ontologies (optionally) test accuracy of NER with different ontologies to

choose ‘best’ ontology to annotate text with tag existing document set with concepts from ontology (optionally

using the synonyms from their synonym source if not in ontology) System produces annotated clinical documents and descriptive

statistics

Page 12: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 12/35

Named Entity Recognition

Prostatic adenocarcinoma

Invasive Prostate Carcinoma

Malignant Prostate Neoplasm

Malignant Prostate Neoplasm

Prostate Neoplasm

Reproductive System Neoplasm Neoplasm

Clinical Document

Ontology Lexical Resource

Metathesaurus (synonyms)SPECIALIST (POS information)

Page 13: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 13/35

Named Entity Recognition

View Annotated Concepts From A Single Ontology

Page 14: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 14/35

Named Entity Recognition

Compare Annotations from Multiple Ontologies

Page 15: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 15/35

Co-reference Resolution

User has clinical documents with NER annotations one or more ontology (optionally) a reference standard of co-reference annotations

User wants to visualize co-references detected using one or more ontologies (optionally) test accuracy of CR with different ontologies to

choose ontology for annotations tag existing document set with co-references from ontology

System produces annotated clinical documents and descriptive statistics

Page 16: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 16/35

Co-reference Resolution

Prostatic adenocarcinoma

Invasive Prostate Carcinoma

Malignant Prostate Neoplasm

Malignant Prostate Neoplasm

Prostate Neoplasm

Reproductive System Neoplasm Neoplasm

Page 17: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 17/35

Discourse Reasoning

User has a set of clinical documents with NER and CR

annotations a set of information models about those

documents User wants to

determine which information model (or parts of them) should be used for which clinical document

Page 18: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 18/35

Discourse Reasoning

BRAIN, RIGHT PARIETAL, STEROTACTIC BIOPSY:Mucinous Adenocarcinoma, consistent with previous history of colon primary

BRAIN

SiteMorphology

COLON

LocationGradeSizeTNM Stage

Page 19: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 19/35

Attribute Value Extraction

User has clinical documents with NER, CR, DR annotations information model of specific subset of documents

Wants to extract attributes and value from clinical text conforming to model Analyze data using common tools possible later search for particular cases

Page 20: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 20/35

Attribute Value Extraction

Histologic TypeClark’s LevelBreslow Depth MitosesUlcerPerineural Invasion Angiolymphatic InvasionRegression

Page 21: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 21/35

Attribute Value Extraction

Histologic Type – Superficial SpreadingClark’s Level – IVBreslow Depth – 1.75 mmMitoses – Greater than 2 per HLPUlcer – NonePerineural Invasion – NoneAngiolymphatic Invasion – NoneRegression - None

Page 22: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 22/35

Ontology Enrichment

User has clinical documents Ontology

User wants to identify potential candidate concepts from the documents to include in the ontology Visualized in a manner to ease search and recognition of

presence of absence of those concepts in the ontology Suggestions for where in taxonomy the concept should be

placed Suggestions for relationships

Page 23: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 23/35

Ontology EnrichmentBreast, Left, Excisional Biopsy:

Mucinous Carcinoma

Breast, Right, Lumpectomy:Infiltrating Ductal Carcinoma

Breast, Left:Invasive Ductal Carcinoma

Breast, Left, Excisional Biopsy:Malignant Phylloides Tumor

Tumor shows osseous and lipomatous metaplasia

Ductal Breast Carcinoma

Breast Carcinoma

Malignant Breast Neoplasm

Breast Neoplasm

Breast Disorder

Disease or Disorder

Invasive Ductal Carcinoma

Page 24: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 24/35

Concept DiscoveryBreast, Left, Excisional Biopsy:

Mucinous Carcinoma

Breast, Right, Lumpectomy:Infiltrating Ductal Carcinoma

Breast, Left:Invasive Ductal Carcinoma

Breast, Left, Excisional Biopsy:Malignant Phylloides TumorTumor shows osseous

and lipomatous metaplasia

Ductal Breast Carcinoma

Breast Carcinoma

Malignant Breast Neoplasm

Breast Neoplasm

Breast Disorder

Disease or Disorder

Invasive Ductal Carcinoma

Page 25: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 25/35

Taxonomic PositioningBreast, Left, Excisional Biopsy:

Mucinous Carcinoma

Breast, Right, Lumpectomy:Infiltrating Ductal Carcinoma

Breast, Left:Invasive Ductal Carcinoma

Breast, Left, Excisional Biopsy:Malignant Phylloides Tumor

Tumor shows osseous and lipomatous metaplasia

Ductal Breast Carcinoma

Breast Carcinoma

Malignant Breast Neoplasm

Breast Neoplasm

Breast Disorder

Disease or Disorder

Invasive Ductal Carcinoma

Mucinous Carcinoma

Malignant Phylloides Tumor

Page 26: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 26/35

RelationshipsBreast, Left, Excisional Biopsy:

Mucinous Carcinoma

Breast, Right, Lumpectomy:Infiltrating Ductal Carcinoma

Breast, Left:Invasive Ductal Carcinoma

Breast, Left, Excisional Biopsy:Malignant Phylloides TumorTumor shows osseous

and lipomatous metaplasia

Ductal Breast Carcinoma

Breast Carcinoma

Malignant Breast Neoplasm

Breast Neoplasm

Breast Disorder

Disease or Disorder

Invasive Ductal Carcinoma

Mucinous Carcinoma

Malignant Phylloides Tumor

has-Finding

MetaplasiaOsseous metaplasiaLipomatous metaplasiaCartilageous metaplasia

Morphologic Finding

Page 27: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 27/35

First Steps

Use cases Survey of Bioportal, LexBio, GATE and UIMA Survey of ontology enrichment techniques Architectural assumptions and notional

architecture Started discussions with Stanford and Mayo Delineated first year work Annotation software and document sets

Page 28: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 28/35

Architecture Decisions The primary goal of ODIE is to serve as a workbench for building and refining text

processing pipelines and ontologies. Information retrieval is not a primary goal. However ODIE may have a

rudimentary search feature for annotated document collections.

ODIE Toolkit will be a desktop application.

ODIE UI will be based on the Eclipse Rich Client Platform.

ODIE will use UIMA as the Language Engineering Platform. GATE processing resources will be usable in ODIE by wrapping them in UIMA TAEs. UIMA is highly configurable using xml descriptor files. Better documentation, community support. We will use GATE in first year for rapid prototyping and manual annotation

ODIE will have the ability to easily import and use UIMA TAEs developed by others. This may be expanded to GATE processing resources.

ODIE will allow for packaging a pipeline for deployment in a production environment.

Page 29: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 29/35

Notional Architecture

Page 30: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 30/35

Synergies: Ontrez

Ontrez

ODIE

• Information Retrieval• Range of inputs

• Other kinds of annotation• Information Extraction• Ontology Enrichment• Clinical Documents

• Annotation• Named Entity Recognition

• Enhance annotation of Ontrez?• Use inference and indexing on clinical documents?

Page 31: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 31/35

Synergies: Mayo

NER and Co-reference resolution Clustering, discovery of synonyms LexGrid Using similar tools, focused on larger range

of document types More – to be explored

Page 32: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 32/35

First Year Work

NER and co-reference modules Concept discovery Develop manually annotated reference

standards for NER and CR Focus on testing and developing algorithms ODIE 1.0 will include basic architecture and

modules for NER, CR and concept discovery, statistics

Page 33: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 33/35

Working Together

Work with Mayo to scope first year collaboration (NER, CR, synonym discovery)

Decisions regarding terminology access Better define what NCBO resources we will

use

Page 34: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

Slide # 34/35

Working Together

SourceForge site, ODIE website and Wiki All our meetings are open and we are happy to

arrange teleconferences Mondays 2-4 pm (EST)

Schedule visits with Mayo and Stanford for early spring ’08

Anticipate providing monthly progress updates at the ODIE website starting in January ‘08

Other ideas? What’s the expectation of the Council?

Page 35: 1 ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

35

Questions?

Comments?