25
Ontology Learning From Text? Robert Stevens BioHealth Informatics Group School of Computer Science University of Manchester [email protected]

Ontology learning from text

Embed Size (px)

Citation preview

Page 1: Ontology learning from text

Ontology Learning From Text?

Robert Stevens

BioHealth Informatics Group

School of Computer Science

University of Manchester

[email protected]

Page 2: Ontology learning from text

Introduction

• Can we use ontology learning to build ontologies?

• Not text-mining research, but ontology research

• What is ontology learning from text?• The questions we posed• The experiment we performed• The results we obtained• The conclusions we made

Page 3: Ontology learning from text

Ontology learning

• Text2Onto: http://ontoware.org/projects/text2onto/

• “The erythrocytes are the blood cells that carry oxygen to others cells in the body”

• “Lymphocytes, leukocytes, monocytes, phagocytes and granulocytes are all kinds of white blood cell”

• “These experiments show that the individual hemopoietic stem cell is a multipotent cell and can give rise to the complete range of blood cell types, both myeloid and lymphoid, as well as new stem cells like itself.”

Page 4: Ontology learning from text

Ontology Learning

Blood Cell

Erythrocyte

White Blood Cell

Monocyte

Leukocyte

Lymphocyte

Phagocyte

Granulocyte

Multipotent Stem Cell

Hemopoietic Stem Cell

arise from

Page 5: Ontology learning from text

Text to Ontology “Workflow”

Corpus

Tokenising / Sentence splitting

Part-Of-Speech (POS) tagging

Lemmatizing / Stemming

JAPE transducer annotates corpus

Text2Onto Algorithms for extracting modeling primitive

Text2Onto meta-ontology

Promotion to OWL ontology

Page 6: Ontology learning from text

Extracting Patterns from Text

“CFU-S is a blood stem cell”

CFU-S[NNP] is[VBN] a[DT] blood[NN] stem[NN] cell[NN]

Sentence:

Part of Speech (POS) Tagging:

Pseudo JAPE rule:

Any series of nouns (A) followed by the string “ is a ” followed by series of nouns (B)

Key: NN=noun; DT=determiner; NNP=proper noun; VBN = verb past participle.

Ontological assertions:

A and B are concepts, A is a subclass of B

Page 7: Ontology learning from text

Text2Onto meta-ontology

Page 8: Ontology learning from text

Some Text2Onto Instances

• Instance: Astrocyte_c– typeOf: Concept that

– Fact: confidence VALUE 1.0

Instance: AstrocycteNerveCell

TypeOf: Subclass that

Fact: domain VA\LUE NerveCell and

FACT: Range VALUE Astrocyte and

Fact: confidence VALUE 1.0

Page 9: Ontology learning from text

The Questions We Asked

• Can we press the button and get a good ontology?

• If not, can we get something useful?

• Can we do it without having to write too many rules?

• Does the end-point act as as a donor or recipient ontology?

Page 10: Ontology learning from text

Strategy

• Collect corpus• Manually markup text for cells: Definitive list

of terms• Process corpus through T2O• Analyse output of T2O for recall and precision

of terms and hierarchy• Iteration of previous two step with variants in

rules• Evaluation against CTO gold standard

Page 11: Ontology learning from text

The Experimental Conditions

• Default T2O• T2O plus cell specific JAPE rules and all

algorithms• Only cell specific JAPE rules, /EntropyExtraction

Algorithm and some “hierarchy spotting” based on term composition

• Same 3, but with VerticalRelationsConceptClassification to include our simple JAPE rules

• Same 4, but with WordConceptClassificaiton for additional hierarchy

Page 12: Ontology learning from text

Rules for Extracting Cell Types

• Words ending in ‘cyte’, ‘blast’, ‘cell’, ‘glia’, ‘glium’, ‘cell type’, ‘cell line’ and ‘cell lineage’ (together with their plurals)

• Zero or more adjectives followed by zero or more nouns or proper nouns followed by a ‘cell word’ (together with plural) e.g. ‘renshaw cell’, ‘Muller cell’, ‘immature blood cell’, etc..

• Any stem cell term is a stem cell

• Any term ending with ‘progeneitor cell’ is a Progenitor Cell.

• Any term ending with ‘precursor cell’ is a Precursor Cell.

• Any term ending in ‘blast’ is a Blast Cell.

• Any term ending with ‘cyte’ or ‘cell’ is a Differentiated Cell.

Page 13: Ontology learning from text

Evaluation Strategy

• Extraction performance

• Ontology evaluation

• Domain coverage

• Expert evaluation

Page 14: Ontology learning from text

Term Recognition

• 1,277 terms in our definitive list• 16,384 terms from whole corpus; 625 relevant• Increase to 17,851 and 916• All 118 CTO terms in corpus recalled• Corpus has anatomical bias• Simple rules exploit regularity of language• Many false positives from adjective noun rule

Page 15: Ontology learning from text

Cell Terms

• Morphology: Stellate cell; columnar cell;• Ploidy• Maturity: Tetrapooil cell; multiploid cell;• Potentiality• Lineage: Totipotent stem cell; multipotent cell;• Species origin• Anatomical location: Animal cell; human sell;• Developmental stage: Mitotic cell; S-phase cell;• Lineage: Mesoderm cell;

Page 16: Ontology learning from text

Common errorsManually

extracted from corpus

Automatically extracted from

corpus

Comments

+t - cell Symbols not handled very well

contains cell False -positive cell type

Foam cell New cell type extracted

leukocyte leucocyte Spelling errors in corpus

naïve cell nave cell Character encoding problem

Spermatogonia No rule to extract

Page 17: Ontology learning from text

Term Recall and Precision

Page 18: Ontology learning from text

Default learnt ontology

Page 19: Ontology learning from text

Final learnt ontology

Still not perfect!

Page 20: Ontology learning from text

Ontology evaluation

Page 21: Ontology learning from text

Learnt Ontology under CTO

Page 22: Ontology learning from text

Discussion

• Exploiting poor performance to focus learning• Exploiting regularity of language• Never really going to find CTO domain general

layer• Terms highly compositional and conflate axes• Ask the questions “is it useful?” not “is it good?”• Is CTO a good standard?• The extracted hierarchy was not bad from a cell

biology and ontological point of view

Page 23: Ontology learning from text

Nascent Methodology

• Form corpus that includes, but is not limited to scope of target ontology

• Extract terms from corpus• Filter and massage list of terms to find those of

ontological interest• Use ontology learning to see what happens• Inspect and augment rules to recognise and

incorporate into hierarchy• Iterate Use as donor ontology to transfer useful

bits to recipient ontology

Page 24: Ontology learning from text

Conclusions

• No;

• Yes;

• Yes;

• Donor

Page 25: Ontology learning from text

Acknowledgements

• Simon Jupp has done the work

• Jaclyn Bibby MSc Project prototype

• Johanna Volker for help with Text2Onto

• David Shotton for knowledge about cell biology