22
Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Embed Size (px)

Citation preview

Page 1: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Why Add Semantics to your Data?Or: Eat your veggies; even a little bit is

good for you

For TBPT F2F Nashville, TN

Sherri de Coronado

Dec 7, 2011

1

Page 2: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Eat Your Veggies

2

Page 3: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Semantics Topics

• Why add semantics to your data? Overview• Continuum of adding semantics• Use Cases

• Drivers for adding semantics• ‘New’ techniques for knowledge discovery with and

without prior annotation• Automated annotation and enrichment analysis• NLP• Linked Data

• Discussion of TBPT needs

Page 4: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Why add semantics?

• Adding semantics is substantial effort; worth the effort?• In your own closed environment, may not be worth the effort.

• Consensus and use of standards allow data to be collected systematically, correctly and re-usably.• Unambiguous representation of meaning of data• Human and computer readability• Provenance and Data Governance!• Difficult to add later

• Increases ability to report large volumes of data to appropriate recipients like IRB, sponsors, monitors

• Secondary use of data – Anticipate reuse• Computer readable semantics improve ability to link to related

information; lowers cost of secondary use, discovery, aggregation

Page 5: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Continuum of Adding Semantics

Post the DD, Schema, etc.

Standard Schema/ etc.

Standard terminology

Standard Terminology & Metadata

Page 6: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Examples Along the Continuum

• Publish the schema or Forms

• E.g. REDcap* forms– easy to use, easy to share, no semantics

• E.g. CAP Pathology Report forms (with schema/ values)

• Mapping or Conversion – Independent Data Models, mapping or equivalent required, can’t join tables

• E.g. C3D cdms sharing data with BTRIS (Biomedical Translational Research Information System)

• caDSR Metadata/ data from C3D converted to BTRIS warehouse schema and terminology

• E.g. Trial registration from one DB sharing data with CTRP

• Standard Terminology

• E.g. CDISC subsets – Shared terms and meaning for reporting

• Standard Terminology and Metadata

• E.g. SDTM, ISA Tab based (MageTab, NanoTab)

• E.g. Clinical forms in caDSR with DE, DEC, properties – DE components + permissible values versioned and tied to terminology

Page 7: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

What can you do with Semantics?

caGrid, etc

CancerData

caGrid, etc

CancerData

LinkedCancerData Cloud

LinkedCancerData Cloud

Page 8: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Use Case: Secondary Use of Clinical Data - SHARP

Page 9: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Data Normalization through Structured Metadata and Terminology: eMERGE (project in SHARP area 4)

• EMR and Genomics – Goals: work towards standardizing phenotype information to help elucidate genotype/ phenotype relationships, leading way toward data for large scale population studies.

• eMERGE – data normalization through structured metadata and terminology• Multicenter study in multiple domains• Mapped phenotype data dictionaries from 5 network sites (using

caDSR, NCIt, SDTM, SNOMED)• Built Elemap, early version of tool to help with mapping DE and

value sets, a common need.

Page 10: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Second Approach - NLP

• cTAKES 1.2 -- Clinical Text Analysis and Knowledge Extraction System (cTAKES). October 20, 2011

• “The Mayo SHARP (SHARPn) Natural Language Processing (NLP) team is excited to announce an updated release of the Clinical Text Analysis and Knowledge Extraction System (cTAKES), cTAKESv1.2. cTAKES is a free and open source NLP system distributed by Mayo Clinic through Open Health Natural Language Processing (OHNLP) consortium which allows researchers to utilize clinical information stored in free text through NLP techniques.”

• Includes a new annotator (beta version), SideEffect, which extracts physician-asserted drug side effects from clinical notes.

• http://sourceforge.net/projects/ohnlp/files/icTAKES/

Page 11: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Use Case: Nanomaterials

• Interpretation of Multidisciplinary Data from Multiple Data Sources• Nanomaterial “space” is too large to synthesize and characterize all

possible materials• Would like to be able to predict chemical, physical, biological properties

of new materials• Improved nanomaterial safety and efficacy

• Need:• Annotation with controlled vocabulary

• Query across sources to find relevant data

• Search with controlled vocabulary terms

• Use semantic relationships to enrich data set retrieval

• Identify data sets that comply with a defined set of minimal characteristic measurements (Data QC)

Page 12: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Physic

ist

Clinician

Chemist

Biolo

gist

Engineer

Regulator

GovernmentEPA, NIOSH, NIST,

NIH, FDA, DOE, USPTO

Industry

AcademiaStanford, Harvard,

MIT/MGH, Emory/GT, OSU, etc.

SDOASTM, ISO,

HL7, CDISC, IEEE

Data Sources and End Users

Page 13: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Physic

ist

Clinician

Chemist

Biolo

gist

Engineer

Regulator

Identification and Composition

Naming, Composition, Surface Chemistry,

Synthesis, Impurities, etc.

Intrinsic Properties

Size Distribution, Shape, Surface Area, Porosity, Refractive

Index, etc.

Extrinsic Properties Agglomeration, Aggregation,

Stability/Degradation, Zeta Potential , Redox

Potential, Catalytic Activity, etc.

ToxicityCytotoxicity , Acute Toxicity , Chronic

Toxicity, Genotoxicity, PK/ADME,

Teratogenicity, etc.

Environmental FateTransport Properties, Biotic Degradability,

Abiotic Degradability, Bioaccumulation,

Degradation Product Toxicity

Types of Nanomaterial Data

Page 14: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Facilitating Nanomaterial-Based Drug Design

Animal Models

Cellular Models

Chemical Properties

Molecular Assays

Manufacturing Processes

Physical PropertiesNano-SAR*

Virtual Manufacturing

Virtual Nanomaterials

In Silico Studies

Adapted from ICR Nanotechnology WG

Manufacturing

Nanomaterials

Preclinical Studies

Clinical Studies

Quantitative Structure-Activity Relationships (SARs) to predict properties

Page 15: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Semantics: Possible Future Directions

• Automated annotations• Lots of un-structured data out there. What can you do?

(Examples)• Nigam Shah (Stanford) – Making sense of unstructured data

• Linked Data• Semantic Web approach

Page 16: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Making Sense of Un-Annotated Medical Data(Nigam Shah – creator of NCBO Annotator)http://www.bioontology.org/making-sense-of-unstructured-data-in-medicine

• Procedure:• Process textual metadata to automatically tag EMR text with as

many ontology terms as possible. (Very large #s of records)• Assign Doc IDs to ontology terms• Create enriched lexicon using fairly simple NLP/ counting/

semantic types• Enrichment analysis a la GO, but using the automated

annotations. Analyze tagged data for hypothesis generation• Workflow was able to detect Vioxx/ MI issue with patient

summary data

Page 17: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Workflow for Annotating Data

Used Method to show could have detected Vioxx-MI relationship in patient data through enrichment analysis using ontologies

17

Vioxx Patients (1,560)

MI Patients (1,827)

VioxxMI (339)

Page 18: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Linked Data

• “Linked Data refers to a set of best practices for publishing and connecting structured data on the Web”

“Linked Data - The Story So Far” 2009. Christian Bizer (Freie Universität Berlin, Germany), Tom Heath (Talis Information Ltd, UK) and Tim Berners-Lee (Massachusetts Institute of Technology, USA) DOI: 10.4018/jswis.2009081901, ISSN: 1552-6283, EISSN: 1552-6291

• Similar web documents but for publishing data on the web• “Rules” for linked data (Berners-Lee 2006)

1. Use URIs as names for things.

2. Use HTTP URIs so that people can look up the names

3. When someone looks up a URI provide useful information using the standards (RDF)

4. Include links to other URIs so that they can discover more things

• Conversion of CDEs and terminology to RDF triple store would make queriable by SPARQL.

• Already substantial progress – e.g. Jim McCusker

Page 19: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

These 4 “Gender” CDEs could be discovered to be able to be aggregatable by leveraging underlying concepts

“female”“female” “2”“2” “Female”“Female” “Female”“Female”

C16576C16576

http://11179.iso/valueMeaningConcept

C46110C46110

http://11179.iso/valueMeaningConcept

http://11179.iso/valueMeaningConcept

http://11179.iso/valueMeaningConcept

21796402179640

http://11179.iso/dataElementID

25290812529081

http://11179.iso/dataElementID

22006042200604

http://11179.iso/dataElementID

21796412179641

http://11179.iso/dataElementID

http://evs.gov/rdfs:subClassOF

caDSR MetadataEVS Metadata

http://evs.gov/rdfs:subClassOF

http://evs.gov/rdfs:subClassOF

Patient (C16960)Patient

(C16960)

http://11179.iso/objectClassConcept

Person (C25190)Person

(C25190)http://11179.iso/objectClassConcept

Participant (C29867)

Participant (C29867)

http://11179.iso/objectClassConcept

http://11179.iso/objectClassConcept

Page 20: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Discussion: Your current/ future semantic needs?Questions/ Issues/ Requests for SI

• Re: Consensus and use of standards allow data to be collected systematically, correctly and re-usably.• Issue: Lots of standards & if you want to interact with systems/

data using other standards, there’s still a lot of work – • Still better off with documented metadata/ meanings/

terminology/ provenance, more amenable to reuse• Question – what tools / techniques are you using, what do you

need? E.g. mappings, mapping tools (terminology or metadata), easier access to terminology in LexEVS e.g. by REST, etc?

• “Diagnoses” in Tissue Bank System vs CBM – translation? (Static mapping? Live mapping? Metadata and terminology transformation to a common model? Just terminology?)

Page 21: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

Discussion: Issues/ Questions cont’d

• How to find out quickly within caTissue if there is a DE that exists?• Use caDSR API to search for DE by name using wildcard search• Future? Rest service or Triple store with SPARQL access?

• Answer the question “how many males and females are on all trials” should return the answer even thought male and female was represented differently in trials -- could do this with a service, or with linked data

• Semantics via e.g. ISA Tab metadata and terminology vs central curation of metadata (e.g. in 11179 repository)? (pros and cons)

• Needs re: NLP. In use, what kind of additional support for caBIG?

• Needs re: automated annotation?

Page 22: Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F Nashville, TN Sherri de Coronado Dec 7, 2011 1

CBIIT Semantic Operations and Infrastructure: Contacts

• Contacts• Sherri de Coronado, Acting

Director. [email protected]• Larry Wright, Associate Director, EVS.

[email protected]

• Margaret Haber, Associate Director, EVS. [email protected]

• Denise Warzel, Associate Director, Informatics Operations. [email protected]

• Dianne Reeves, Associate Director, Biomedical Data. [email protected]

• Gilberto Fragoso, Associate Director, EVS. [email protected]

• NCI/CBIIT Semantic Infrastructure/ Roadmap:

• Christo Andonyadis, Acting CTO. [email protected]

• Selected Urls• NCIm Browser:

http://ncim.nci.nih.gov• NCI Term Browser:

http://nciterms.nci.nih.gov• NCIt Browser:

http://ncit.nci.nih.gov• CDE Browser:

https://cdebrowser.nci.nih.gov/CDEBrowser/

• VKC for LexEVS info: https://cabig-kc.nci.nih.gov/Vocab/KC/index.php/LexBig_and_LexEVS