40
ECOR European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine, UG, Belgium) Werner Ceusters European Centre for Ontological Research Universität des Saarlandes Saarbrücken, Germany

ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

Embed Size (px)

Citation preview

Page 1: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Basic Introduction toOntology-based

Language Technology (LT)for the Biomedical Sciences

(1st year Biomedicine, UG, Belgium)

Werner CeustersEuropean Centre for Ontological Research

Universität des Saarlandes

Saarbrücken, Germany

Page 2: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Purpose of this lecture

• Introduce some keywords

• Give just a taste for ontology-based LT in Biomedicine

• Induce interest for further research

Page 3: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research Biomedicine:

A Great Area for LT

• Educated users

• High utility of NLP

• Doesn’t require solution to general problem

• Complex and interesting (not just IE)

• Recent surge in data

• Knowledge bases available

Hinrich Schütze, Novation BiosciencesRuss Altman, Stanford University

Page 4: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research Biomedical Data Mining

and DNA Analysis

• DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T).

• Gene: a sequence of hundreds of individual nucleotides arranged in a particular order

• Humans have around 100,000 genes• Tremendous number of ways that the nucleotides can be ordered

and sequenced to form distinct genes• Semantic integration of heterogeneous, distributed genome

databases– Current: highly distributed, uncontrolled generation and use of

a wide variety of DNA data– Data cleaning and data integration methods developed in data

mining will help

Jiawei Han and Micheline Kamber

Page 5: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research DNA Analysis: Examples

• Similarity search and comparison among DNA sequences– Compare the frequently occurring patterns of each class (e.g.,

diseased and healthy)– Identify gene sequence patterns that play roles in various diseases

• Association analysis: identification of co-occurring gene sequences– Most diseases are not triggered by a single gene but by a

combination of genes acting together– Association analysis may help determine the kinds of genes that

are likely to co-occur together in target samples• Path analysis: linking genes to different disease development stages

– Different genes may become active at different stages of the disease

– Develop pharmaceutical interventions that target the different stages separately

• Visualization tools and genetic data analysis

Jiawei Han and Micheline Kamber

Page 6: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Task descriptions• Sequence similarity searching

– Nucleic acid vs nucleic acid 28– Protein vs protein 39– Translated nucleic acid vs protein 6– Unspecified sequence type 29– Search for non-coding DNA 9

• Functional motif searching 35• Sequence retrieval 27• Multiple sequence alignment 21• Restriction mapping 19• Secondary and tertiary structure prediction 14• Other DNA analysis including translation 14• Primer design 12• ORF analysis 11• Literature searching 10• Phylogenetic analysis 9• Protein analysis 10• Sequence assembly 8• Location of expression 7• Miscellaneous 7• Total 315

Stevens R, Goble C, Baker P, and Brass A. A Classification of Tasks in Bioinformatics. Bioinformatics 2001: 17 (2):180-188.

Page 7: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Three major challenges

• Analyse massive amounts of data:– Eg: high throughput technologies based upon cDNA or

oligonucleotide microarrays for analysis of gene expression, analysis of sequence polymorphisms and mutations, and sequencing

• Appropriately link clinical histories to molecular or other biomarker data generated by genomic and proteomic technologies.

• Development of user-friendly computer-based platforms – that can be accessed and utilized by the average

researcher for searching, retrieval, manipulation, and analysis of information from large-scale datasets

Page 8: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

BUT !!!• Majority of data buried in

–huge amounts of texts

–Incompatibly annotated databases

Page 9: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Text overload

– According to a conservative estimate, the number of digital libraries is more than 105.

• [Norbert Fuhr 03]

– Google indexed over 4.28 billion web pages; • from Google press release.

– But, any single engine is prevented from indexing more than one-third of the “indexable web”.

• from Science.Vol.285, Nr.5426.

Page 10: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Objectives of LT inBiomedical Informatics

• Make large volumes of scientific texts better accessable

• Assist annotation of genome and phenome to allow better linking of the data– CSB: Computational Systems Biology

• Link biomedical data with patient record data

Page 11: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological ResearchKnowledge discovery and use

Page 12: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Cost effectiveness

Uti

li ty

Artificial Intelligence

CycInformation Extraction

Fastus

Primary LiteratureReading Keyword-based

RetrievalPubMed

Structure Mining

Low Hi

Low

HiManual Knowledge RepresentationRiboweb

Text Mining Technologiesfor Biomedicine

Hinrich Schütze, Novation BiosciencesRuss Altman, Stanford University

Page 13: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previously characterized functions in a separate process.

The use of available information (published papers, etc.) is a key step for the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used to substantiate working hypothesis that are experimentally explored.

[C.Blaschke, A.Valencia: 2001]

Page 14: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Text-basedknowledge discovery

• Goal:Finding “new” biomedical scientific knowledge through the combination of existing knowledge as represented in the medical literature

• Motivation:Prevention of re-inventing the wheel, re-usage of specific knowledge outside the original domain of discovery

Page 15: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Swanson

Substance

A

Effects B

Disease C

Fish oil

High blood viscosity Platelet

aggregation

Raynaud’s

disease

Page 16: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

by C. Blaschke

Protein-Protein Interaction extracted

from texts

Page 17: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Some classifiers/learning methods

Steps of Knowledge Discovery

• Training data gathering• Feature generation

– k-grams, domain know-how, ...

• Feature selection– Entropy, 2, CFS, t-test, domain know-how...

• Feature integration– SVM, ANN, PCL, CART, C4.5, kNN, ...

Limsoon Wong

Page 18: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

• Basic use components: end-user– Corpus Management tool– Parser– Export module

• Management components:– Corpus editor super

user– Grammar building workbench super user– Domain Ontology editor super user– Parser generator exporter– Linguistic ontology (multi-lingual use) exporter

Functional componentsfor text-based

feature generation system

Page 19: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

• Short term: single domain– Corpus collection & analysis– Domain model design & implementation – Grammar Development – Corpus Manipulation Engine– Integration in Biomining package

• Long term: generic system – Grammar Building Workbench– Parser Generator– Documentation

What does it taketo build such a system ?

Page 20: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

A “statistics only system”

22 page full paper

ABSTRACT ONLY

Page 21: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research Relative Concept/Node

identification (real)

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Nr of words

concepts

nodes

Statistic analysisis powerful,

but not enough

Page 22: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Clean separation of knowledge

for deep understanding

The Galen view:

– linguistic knowledge– conceptual knowledge– pragmatic knowledge– criteria knowledge– terminological

knowledge

The LT view:

– phonologic knowledge

– morphologic knowledge

– syntactic knowledge

– semantic knowledge

– pragmatic knowledge

– world knowledge

Page 23: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

One word – multiple meanings

• Abbreviation Extraction ( Schwartz 2003 )– Extracts short and long form pairsShort form Long form

AA Alcoholic Anonymous

American

Americans

Arachidonic acid

arachidonic acid

amino acid

amino acids

anaemia

anemia

:

Page 24: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Syntactic variant detection

• Corpus– MEDLINE: the largest collection of abstracts in

the biomedical domain

• Rule learning– 83,142 abstracts– Obtained rules: 14,158

• Evaluation– 18,930 abstracts– Count the occurrences of each generated

variant.Tsuruoka, et.al. 03 SIGIR]

Page 25: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Results: “antiinflammatory effect”

Generation Probability

Generated Variants Frequency

1.0 (input) antiinflammatory effect 7

0.462 anti-inflammatory effect 33

0.393 antiinflammatory effects 6

0.356 Antiinflammatory effect 0

0.286 antiinflammatory-effect 0

0.181 anti-inflammatory effects 23

: : :

Page 26: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research Results:

“tumour necrosis factor alpha”

Generation Probability

Generated Variants Frequency

1.0 (Input) tumour necrosis factor alpha 15

0.492 tumor necrosis factor alpha 126

0.356 tumour necrosis factor-alpha 30

0.235 Tumour necrosis factor alpha 2

0.175 tumor necrosis factor alpha 182

0.115 Tumor necrosis factor alpha 8

: : :

Page 27: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

DNAPROTEIN

DNA CELLTYPE

and classify

Thus, CIITA not only activates the expression of class II genes

but recruits another B cell-specific coactivator to increase

transcriptional activity of class II promoters in B cells .

• Recognize “names” in the text– Technical terms expressing proteins, genes,

cells, etc.

Biomedical NE Task (Collier Coling00,Kazama ACL02, Kim ISMB02)

Identify Junichi Tsujii

Page 28: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research Text mining and classification

Having a healthcare phenomenon

Generalised PossessionHealthcare phenomenonHuman

IS-A

Has-possessor Has-

possessed

PatientIs-possessor-of

Cancer patient

IS-A

Has-Healthcare-phenomenon

Malignant neoplasm

IS-A

11

1

2

2

IS-A

3

3lung carcinoma

IS-A

Mr. Smith has a pulmonary carcinoma

Page 29: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research Data integration approaches

• Protein interaction databases

• Small molecule databases

• Genome databases

• Pathway databases

• Protein databases

• Enzyme databases GeneOntology

at least, the beginnings of ...

Page 30: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Page 31: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research Data Integration approaches

1. Data Warehousing : Data from various data sources are converted, merged and stored in a

centralized DBMS. (Examples) Integrated Genomic Database 2. Hyperlinking approaches: Where links are set up between related information and data sources.

SRS, Entrez (NCBI)3. Standardization:

Efforts which address the need for a common metadata model for various application domains.

4. Integration systems: Systems that can gather and integrate information from multiple sources. Some of these systems have a Mediator-Wrapper Architecture others are language based systems like Bio-Kleisli.

5. Federated Database:Cooperating, yet autonomous, databases map their individual schema’s to a single global schema. Operations are preformed against the federated schema.

Steve Brady

System Integration approaches

Page 32: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research CoMeDIAS (France)

Page 33: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

GenesTraceTM: Biological Knowledge Discovery via Structured Terminology

Page 34: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research The XML misconception

<?XML version="1.0" ?><?XML:stylesheet type="text/XSL" href="cr-radio.xsl" ?><CR-RADIOLOGIE><ENTETE> <INFORMATION-SERVICE> <HOPITAL>Groupe hospitalier Léonard Devintscie</HOPITAL> <SERVICE>Radiologie Centrale</SERVICE><MEDECIN>Dr. Bouaud</MEDECIN> <TITRE-EXAMEN>Phlébographie des membres inférieurs</TITRE-EXAMEN> </INFORMATION-SERVICE> <INFORMATION-DEMANDE> <SERVICE>Sce Pr. Charlet</SERVICE><MEDECIN>Dr. Brunie</MEDECIN> <DATE>29-10-99</DATE> </INFORMATION-DEMANDE> <INFORMATION-PATIENT ID="236784020"><NOM>Donald</NOM> <PRENOM>Duck</PRENOM></INFORMATION-PATIENT></ENTETE> <BODY> <INDICATION>Suspicion de phlébite de jambe gauche</INDICATION> <TECHNIQUE>Ponction bilatérale d’une veine du dos du pied et injection de 180cc de produit de contraste</TECHNIQUE> <RESULTATS>image lacunaire endoluminale visible au niveau des veines péronières gauche. Absence d’opacification des veines tibiales antérieures et postérieures gauches. Les veines illiaques et la veine cave inférieure sont libres. </RESULTATS> <CONCLUSION>Trombophlébite péronière et probablement tibiale antérieure et postérieure gauche.</CONCLUSION> </BODY></CR-RADIOLOGIE>

Page 35: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research Towards Machine Readable

Semantics

Form Structure Meaning Function

StyleTypeDefinition

DocumentTypeDefinition

InformationTypeDefinition

KnowledgeTypeDefinition

Layout Outline Content Behaviour

BoldCentredAlign Left

Blink

TitleParagraphHeading1

Play

SubjectisPartOfDate

After_value

UtilityaffectedBy

ReceiveProtect

Data about

Formalism

Cases Static

Dynamic

Standard

WorkflowTypeDefinition

Usage

Actor

ReceivalMaintenanceArchival

Process

Hao Ding, Ingeborg T. Sølvberg

Page 36: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Triadic models of meaning: The Semiotic/Semantic triangle

Sign:Language/

Term/Symbol

Referent:Reality/Object

Reference: Concept / Sense / Model / View

Page 37: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

There is ontology and “ontology”• Ontology in Information Science:

– “An ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents.”

• Ontology in Philosophy:– “Ontology is the science of

what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality.”

concept

term referent

definition

concept

term referent

definition

Page 38: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research

Why are conceptsnot enough?

• Why must our theory address also the referents in reality?– Because referents are observable fixed

points in relation to which we can work out how the concepts used by different communities relate to each other ;

– Because only by looking at referents can we establish the degree to which concepts are good for their purpose.

Page 39: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research Or you get nonsense:

Definition of “cancer gene”

Page 40: ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,

ECOREuropean Centre forOntological Research Take home message:

Language Technology requiresa clean separation of knowledge AND (the right sort of) ontology

Conceptual knowledge: the knowledge of sensible domain concepts

Knowledge of definitions and criteria: how to determine if a concept applies to a particular

instance

Surface linguistic knowledge: how to express the concepts in

any given language

Knowledge of classification and coding systems: how an expression has been classified by such a system

Pragmatic knowledge: what users usually say or think, what they consider important, how to integrate in software

Ontology: what exists and how what exists relates to each other