24
Biosemantics group Martijn Schuemie

Biosemantics group Martijn Schuemie. Overview The biosemantics group Ontology assembly Concept tagging Homonym disambiguation Concept profile

Embed Size (px)

Citation preview

Page 1: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Biosemantics group

Martijn Schuemie

Page 2: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Overview

The biosemantics group

Ontology assembly

Concept tagging

Homonym disambiguation

Concept profile creation

Nucleolus

Page 3: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Biosemantics group

ErasmusMC University Medical Center Rotterdam

Department of Medical Informatics

Biosemantics group

Jan Kors

Barend Mons

Erik van Mulligen

Martijn Schuemie

Rob Jelier

Kristina Hettne

Antoinne van Veldhoven

Page 4: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Biosemantics group

Biosemantics

Molecular Biology

High througput experiment data (genomics and proteomics)

Gene and protein databases, MEDLINE, Gene Ontology

Biosemantics

Concept-based text-mining

Interpretation of experiment data

Knowledge discovery

Page 5: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Ontology assembly

Entrez Gene Swiss-Prot HUGO

Combination

Add spelling variationsABC1 -> ABC-1DEF3 -> DEF-III

Remove highly ambiguous terms

CO2, membrane-boundobesity, open reading frame

P=37%, R=76%

P=50%, R=75%

Page 6: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Concept tagging

MEDLINE text Malaria fever is a disease. It is spread by mosquitos.

Sentence splitting [Malaria fever is a disease.] [It is spread by mosquitos.]

Tokenization [Malaria] [fever] [is] [a] [disease]

Word normalisation [malaria] [fever] [be] [a] [disease]

Concept mapping [malaria fever] C24530 [disease] C12634

Homonym disambiguationPSA -> Prostate Specific Antigen or Poultry Science Association?

Concept profile of text

Page 7: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Homonym disambiguation

Some simple rules:• Is it likely that a term has multiple meanings?

- 3-letter-acronym (e.g. PSA): highly likely- long forms (e.g. Prostate Specific Antigen): highly unlikely- terms that refer to several concepts by definition

• Is a synonym found? (e.g. “KLK3 (PSA)”)

• Is a keyword found? (e.g. “PSA is secreted by the prostate”)

These simple rules change performance from P=50%, R=75% to P=71%, R=71%.

Page 8: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Homonym disambiguation

Concept profile of text containing PSA

Concept profile of Prostate Specific Antigen

Concept profile of Phosphoserine Aminotransferase

Unknown meaning

Similarity?

Previous tests showed an overall accuracy of 93%

Page 9: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Concept profile creation

Concept profile of textConcept profile of textConcept profile of text Concept profile of concept

TextTextText Concept

- From databases- By concept mapping

Page 10: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Concept profile creation

Binary

Log likelihood

X IDF

Uncertainty cf.

Page 11: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Concept profile creation

Profile of gene ESR1:

estrogen receptor 1

breast neoplasm 0.5

BRCA1 0.34

PGR 0.30

Estrogen 0.28

BRCA2 0.25

TP53 0.15

gene suppressor tumor 0.12

genetics polymorphism 0.12

genetic predisposition to disease 0.10

female 0.05

Page 12: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Concept profile comparison

Page 13: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Concept profile comparison

Concept Name Weight RAB27B MYRIP MLPH RAB27A

RAB27A 52.17 0.61 0.74 0.73 1

MLPH 11.16 - 0.44 1 0.29

Myosin Type V 7.22 0.04 0.68 0.4 0.22

Melanosomes 6.7 0.12 0.3 0.47 0.27

RAB27B 4.06 1 0.14 - 0.11

MYRIP 2.98 0.07 1 0.09 0.06

Melanocytes 2.73 0.13 0.14 0.28 0.17

Myosins 2.33 0.04 0.38 0.22 0.12

Myosin Heavy Chains 1.72 - 0.46 0.18 0.09

GTP Phosphohydrolases 1.31 0.17 0.23 0.04 0.08

Actins 1.17 0.05 0.32 0.12 0.06

Exocytosis 0.87 0.08 0.12 0.08 0.12

Secretory Vesicles 0.68 0.07 0.16 0.06 0.09

Carrier Proteins 0.59 - 0.11 0.17 0.09

Organelles 0.54 0.11 - 0.12 0.09

rab GTP-Binding Proteins 0.52 0.16 - 0.04 0.12

Page 14: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Nucleolus

• main function: ribosome biogenesis

• over 700 proteins identified and classified into 8 main categories

Page 15: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

MEDLINE article

Nucleolus – Concept profiles

Concept profile of textConcept profile of textConcept profile of text Concept profile of protein

Protein- From databases

MEDLINE articleMEDLINE article

Page 16: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Nucleolus – Concept profiles

BLAST (Basic Local Alignment Search Tool)

Query: nucleolar protein

Results: homologs in• human• mouse• fruitfly• yeast

Page 17: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Nucleolus – Concept profiles

Minimum Maximum Mean

Human 0 9 1.66

Mouse 0 10 1.37

Fruitfly 0 5 0.7

Yeast 0 8 1.21

Articles 1 1046 91.31

Homologs used

Articles used

Page 18: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Nucleolus – fun with protein profiles

• 2D visualization of high-dimensional space

• Automatic functional annotation of proteins

• Finding similar proteins

Page 19: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Nucleolus - visualisationFunction unknow nChaperonesChromatin structureFibrous proteinsmRNA metabolismOthersRibosomal proteinsRibosome biogenesisTranslation

SRPPARN

Exosome comp. 10

O43390P98179

Q8N220Multi-Dimensional Scaling

Page 20: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Nucleolus – Assigning GO terms

MEDLINE article

Concept profile of textConcept profile of textConcept profile of text Concept profile of GO term

GO term- From GO

MEDLINE articleMEDLINE article

Page 21: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Nucleolus – Assigning GO terms

AuC : Area under Curve

Category AuC pChaperones 1.00 <.001Chromatin Structure 0.98 <.001Fibrous proteins 0.97 <.001mRNA metabolism 0.72 <.001Others 0.81 <.001Ribosomal proteins 0.97 <.001Ribosome biogenesis 0.69 <.001Translation 0.88 <.001

Page 22: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Nucleolus – Assigning GO terms

1. Manual assignment to one category only

e.g. SFRS protein kinase 1 plays a role in splicing,but is also in kinase

2. Assumptions do not always hold• Sequence homology ≠ function homology• Concept co-occurrence ≠ functional relationship

3. Homonyms

‘Mistakes’ in automatic annotation

Page 23: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Nucleolus – Finding new proteins

Concept profile ofnucleolar protein

Concept profile ofhuman protein

Concept profile ofhuman protein

Concept profile ofhuman protein

Page 24: Biosemantics group Martijn Schuemie. Overview  The biosemantics group  Ontology assembly  Concept tagging  Homonym disambiguation  Concept profile

Nucleolus – Finding new proteins

60S ribosomal protein L3-likeProbable ATP-dependent RNA helicase DDX4ATP-dependent RNA helicase DDX3Y Guanine nucleotide binding protein-like 3 Importin-11 (importin beta family)Putative Brix domain containing protein 1PProbable ATP-dependent RNA helicase DDX20 (Gemin 3)60S acidic ribosomal protein P0Helicase SKI2WATP-dependent RNA helicase DDX3940S ribosomal protein S20Probable ATP-dependent RNA helicase DDX6Probable ATP-dependent RNA helicase DDX23 Double-stranded RNA-binding protein Staufen homolog 1ATP-dependent RNA helicase DDX25Probable nucleolar complex protein 14Eukaryotic initiation factor 4A-IIATP-dependent RNA helicase DDX19B40S ribosomal protein S3

Ribosomal proteinDEAD-boxDEAD-boxFound in nucleolusAssociated with nucleolar p.DEAD-boxDEAD-boxDEAD-boxFound in nucleolusDEAD-boxRibosomal proteinDEAD-boxDEAD-boxIndirect evidence DEAD-boxNucleolarDEAD-boxDEAD-boxRibosomal protein