24
Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C Workshop on RDF Access to Relational Databases 25-26 October, 2007 — Boston, MA, USA

Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

Embed Size (px)

Citation preview

Page 1: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

Driving the Terminology Hub

RDF Triplets as a means to express lexical and referential data.

Therese Vachon, NIBR, Unit Head UltraLink Technologies

W3C Workshop on RDF Access to Relational Databases

25-26 October, 2007 — Boston, MA, USA

Page 2: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

2 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Requirements

Cross-linking of database information on e.g. genes, proteins, metabolic pathways, compounds, ligands. to the original sources is a key issue.

The productivity for accessing, sharing, searching, navigating, cross-linking and analyzing internal data and external data relevant for the Pharmaceutical industry should be increased

Page 3: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

3 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Strategy

In NIBR, we have been developing a semantic integration layer on top of knowledge resources that has been implemented within various services and applications.

It uses• A rich domain-specific terminology (biology, chemistry and

medicine) containing 1.6 Mio terms

• A Terminology Hub containing 8 GB of referential data (cross-references between data repositories.)

Using that knowledge, the scientist can access all data at hand with just a single mouse-click.

Page 4: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

4 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Application Areas for Terminologies

Categorization of documents (via associated taxonomies)

Search for concepts

Semantic expansion of queries using synonyms and related terms

Identification and extraction of relevant concepts (like e.g. targets, genes, diseases, products) from texts

Annotation of textual data with controlled terms as referential anchors

Construction of a semantic layer on top of information sources allowing navigation context-sensitive navigation (Ultralink)

Page 5: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

5 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Application Areas for the Terminology Hub

Coherent mapping between Terminologies and Coding Systems (e.g. Uniprot Accession Number for a Protein)

Coherent mapping between internal knowledge repositories(e.g. Biological Assays and Chemical Compounds)

Coherent mapping between external knowledge repositories (e.g. HUGO and OMIM)

Coherent mapping between internal and external knowledge repositories (e.g. Internal Project Code and Product Name)

Ultralink makes both of terminologies (entity recognition) and terminology hub (cross referencing)

Page 6: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

6 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

2

Activation UltralinkUltralink Plug-in icon

Activation Concept Types Frame

UltraLink

Page 7: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

7 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Page 8: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

8 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

The Landscape of Knowledge - Rooting the Ultralink in Data Sources/Terminologies

The Ultralink makes use of a broad range of knowledge sources both internal to Novartis and external. The linkage of these terminologies provide the routes along which you can navigate when using the Ultralink.

The linkage between the resources is created automatically via a rule-based mapping procedure and manually by annotation. The latter is extremely important for connecting internal knowledge sources together and to external ones.

The annotations built on the fly by the UltraLink could be stored as RDF annotations associated to a document and be accessed by other computer programs – just in the spirit of the Semantic Web

Page 9: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

9 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

The Landscape of Knowledge - Rooting the Ultralink in Data Sources/Terminologies

Concepts and Terminology

Concepts and Data

Page 10: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

10 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

• > 15’000 Companies with > 35’000 terms

• > 2’000 Diseases with >19’000 terms

• > 150’000 Genes with about 400’000 terms

• > 5’000 Modes of Action with > 12’000 terms

• > 95’000 Products with > 380’000 terms

• > 170’000 Targets with > 250’000 terms

• > 310’000 Species with > 435’000 terms

• + complete MESH and EMTREE

• More than 1’600’000 terms• The terminology consists of terms, and relations between terms (main

entry: normalized terms, synonyms, broader terms, narrower terms)

Underlying terminologies used at NIBR

Page 11: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

11 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Principles used for the construction of the terminology and organization of terms

In order to create the terminology of reference, terms are extracted from available terminologies (e.g. UniProt, EntrezGene, HGNC, etc.) and the references to the source systems are preserved.

Terms specific to a database are referred as local terms.  These local terms are stored in a dedicated data structure, the Metastore. Besides the flat set of terms, thesaurus relations such as synonymy, broader term and narrower terms are extracted as well thus allowing to create a thesaurus.

For each entry in the terminology like e.g. for a gene name or for a product, a term is chosen among the list of synonyms and is declared as a “normalized term”

Normalized / global terms, synonyms / local terms as well as broader and narrower terms together with their sources of reference constitute the terminology content behind the UltraLink and are used by the Terminology Hub.

Page 12: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

12 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Different knowledge repositories have different ways to encode a concept:

• Registry Number

• Unique Internal ID

• Concept Identifier

• Enumerating terms

• Just using different terms without any constraints

Searching a term T both in source A and B may lead to different

results because of different naming/referencing conventions

(false negatives in IR)

Terminology Hub ensures coherent mapping

• Between coding systems

• Between different representation levels (e.g. ID vs. Concept)

• Between local terms and global terms

More than 8 GB of cross-referencing information

Creating Reference – the Terminology Hub

Page 13: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

13 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Classes of objects covered by the Terminology Hub

Coding systems • A coding system provides a predefined set of (sometimes

hierarchical) codes to represent a classification, a nomenclature, a controlled vocabulary, a thesaurus or chemical structures. For example, you can use the MeSH®  Tree number C06.405.205.697 to refer to Gastritis in a specific sub-tree of MeSH®

References• Unique and unequivocal identifiers based on a coding system create

references in their corresponding data repository. By nature, they are technical artifacts and not part of our scientific natural language (e.g. FTY720), nevertheless most of them deserve to be identified, being used in scientific literature.

Pointers and cross-referencing information• The Metastore contains pointers that allow to cross reference

knowledge sources and applications.

Page 14: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

14 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Classes of objects covered by the Terminology Hub

Terms• A term is the smallest meaningful linguistic unit on which our domains of

discourse (biology, chemistry, medicine) are based. A term is something different than a word because a term can consist of multiple meaningful words such as “chronic obstructive pulmonary disease”.

Concepts• A concept is an abstraction based on properties of individuals that we

observe in the world. Individuals that belong to the same concept share a set  of common properties. For example, “targets” share the property that they should be druggable.

Data Repositories also named Knowledge Sources• For all kinds of different data, we use the general notion of a data

repository. Using the term “data repository” we emphasize the fact that there is a source where some data resides without making any commitments about physical representation (e.g. database or text file) or format of representation (e.g. structured or free text).

Page 15: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

15 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Termsspinal cordvascular endothelial growth factorCCR5Glivecovarian cancerNovartisCytomegalovirus...

ConceptsSpeciesProductsCompaniesDiseasesGenesTargetsMammalian Genes...

ReferenceCompound nosProject codesCompetitor codesPMID 9683255EntrezGene 450128CAS 439-14-5Patent numbers

EncodingIUPACStructuresIDsGIFSymbolsFormulasRegistry Numbers...

Data RepositoriesInternal Chemistry DBCI sourcesLiteraturePatents... has-typeencodes

points--to

synonym-ofbroadernarrower

is-a

Classes of objects covered by the Terminology Hub

Page 16: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

16 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Achievements and Improvements

All information about terminologies and cross-references is stored in a relational database (Oracle 10.2.0.2).

The data in the database can be accessed through WebServices allowing user to find normalized terms, pointers for a specific concept-type etc.

Page 17: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

17 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Metastore Web ServiceGet all synonyms for a normalized form

Page 18: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

18 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

UltraLink Web ServicesGet all accessible pointer types for a normalized form

Page 19: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

19 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Achievements and Improvements

We intend to improve the semantic representation of the data in order to facilitate reuse, interoperability and exchange.

RDF notation and RDF coding standards provide an adequate means for a richer semantic representation.

We use SKOS, DublinCore and other RDF-based coding standards and supplement them with our own RDF vocabulary.

Page 20: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

20 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Simple Knowledge Organisation System (example)

Page 21: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

21 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Terminology for Diseases (SKOS fragment)

Page 22: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

22 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Converting Terminologies to RDF

Clear separation of terminologies from ontologies. We assign a type (rdf:type) to the URI of a term as reference to a concept in an ontology.

Conversion to RDF increased the amount of data rougly by the factor 3.

We obtained more than 5 Mio RDF triplets as a preliminary representation of our terminologies.

We are currently setting up the entire workflow for generation, storing and querying RDF.

Page 23: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

23 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Conclusion

The first phase of transforming the terminology to RDF-XML is completed

We are currently developing a model for representing the Terminology Hub in RDF. We expect that an RDF notation of the Terminology Hub will comprise approximately 50 Mio. RDF triples

We intend to test the framework thoroughly (performance, effective semantic gain compared to the current technology)

Closer collaboration with the W3C Healthcare group

Page 24: Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C

24 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007

Semantic & Text Analytics LayerMartin RomackerPierre ParisotNicolas Grandjean

Data Integration & Services LayerAlexander FrommLaurent Mentek

Application LayerDaniel CronenbergerOlivier Kreim

Acknowledgements

Thanks to Manuel Peitsch

Thanks to the ULT team