41
Literature Based Discovery Dimitar Hristovski [email protected] Institute of Biomedical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia

Literature Based Discovery

  • Upload
    kamea

  • View
    25

  • Download
    1

Embed Size (px)

DESCRIPTION

Literature Based Discovery. Dimitar Hristovski dimitar .hristovski @ mf.uni-lj.si Institute of Biomedical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia. Let me introduce myself … Research and Development. BS – Biomedicina Slovenica database - PowerPoint PPT Presentation

Citation preview

Page 1: Literature Based Discovery

Literature Based Discovery

Dimitar Hristovski [email protected]

Institute of Biomedical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia

Page 2: Literature Based Discovery

Let me introduce myself … Research and Development

• BS – Biomedicina Slovenica database• Research Evaluation Decision Support System• Medical Information Systems

– Surgical clinics– Genetic laboratory– Biochemical laboratory

• Web User Behaviour Analysis• Data warehousing and OLAP

Page 3: Literature Based Discovery

Motivation

• Overspecialization

• Information overload

• Large databases

• For many diseases the chromosomal region known, but not the exact gene

Page 4: Literature Based Discovery

Background

• Literature-based discovery (Swanson):

Concept X(Disease)

Concepts Y(Pathologycal or Cell Function, …)

Concepts Z(Genes)

New Relation?

Page 5: Literature Based Discovery

Biomedical Discovery Support System (BITOLA)

• Goal: – discover potentially new relations (knowledge) between

biomedical concepts – to be used as research idea generator and/or as– an alternative way to search Medline

• System user (researcher or intermediary):– interactively guides the discovery process– evaluates the proposed relations

Page 6: Literature Based Discovery

Extending and Enhancing Literature Based Discovery

• Goal:– Make literature based discovery more suitable for

disease candidate gene discovery– Decrease the number of candidate relations

• Method:– Integrate background knowledge:

• Chromosomal location of diseases and genes• Gene expression location• Disease manifestation location

Page 7: Literature Based Discovery

Usage Scenarios

• For a disease with known chromosomal location, find a candidate gene

• For a gene, find a disease that might be influenced

• For a disease and gene found to be related by linkage study, find the mechanism of the relation (intermediate concepts should help)

Page 8: Literature Based Discovery

System Overview

Knowledge Base

Concepts

Association Rules

Background Knowledge (Chromosomal Locations, …)

Discovery Algorithm

User Interface

Databases (Medline, LocusLink, HUGO, OMIM, …)

Knowledge Extraction

Page 9: Literature Based Discovery

Databases

• Medline: source of known relationships between biomedical concepts

• Set of concepts:– MeSH (Medical Subject Headings): Controlled dictionary and

thesaurus used for indexing and searching the Medline database– HUGO: official gene symbols, names and aliases– LocusLink: gene symbols, aliases and chr.locations– OMIM: genetic diseases

• UMLS (Unified Medical Language System)• Entrez: used to search PubMed, GenBank, ...• UniGene: gene expression

Page 10: Literature Based Discovery

Knowledge Extraction

• Build master set of concepts (MeSH terms and gene symbols)

• Extract occurrence of concepts from each Medline record (MeSH terms from MH field, gene symbols from Title and Abstract)

• Association rule mining (concept co-occurrence)• Chromosomal location extraction (from

LocusLink and HUGO)• Load into knowledge base

Page 11: Literature Based Discovery

Terminology Problems during Knowledge Extraction

• Gene names

• Gene symbols

• MeSH and genetic diseases

Page 12: Literature Based Discovery

Detected Gene Symbols by Frequency

• type|666548• II|552584• III|201776• component|179643• CT|175973• AT|151337• ATP|147357• IV|123429• CD4|99657• p53|89357• MR|88682• SD|85889• GH|84797• LPS|68982• 59|67272• E2|64616

• 82|63521• AMP|61862• TNF|59343• RA|58818• CD8|57324• O2|56847• ACTH|54933• CO2|53171• PKC|51057• EGF|50483• T3|49632• MS|46813• A2|44896• ER|43212• upstream|41820• PRL|41599

Page 13: Literature Based Discovery

Gene Symbol Disambiguation

• Find MEDLINE docs in which we can expect to find gene symbols

• JD indexing (Susanne Humphrey) as possible solution:– Identifies the semantic context of docs– If semantic context not genetic, then gene symbol

probably false positive• Example of false positive:

– Ethics in a twist: "Life Support", BBC1. BMJ 1999 Aug 7;319(7206):390

– breast basic conserved 1 (BBC1) gene, v.s. BBC1 television station featuring new drama series Life Support

Page 14: Literature Based Discovery

JD Indexing

• JDs are 127 Journal Descriptors (e.g., JDs for journal Hum Mol Genet: Cytogenetics; Genetics, Medical)

• Training set docs (435,000) inherit JDs from journals• Training set provides co-occurrence data between inherited

JDs and:– indexing terms assigned to docs directly– words in docs

• Docs having indexing terms/words occurring often with genetics JDs in tr. set assumed to have genetics context

• Extended to indexing by 134 UMLS semantic types (e.g. Gene or Genome, Gene Function,…)

Page 15: Literature Based Discovery

System Overview

Knowledge Base

Concepts

Association Rules

Background Knowledge (Chromosomal Locations, …)

Discovery Algorithm

User Interface

Databases (Medline, LocusLink, HUGO, OMIM, …)

Knowledge Extraction

Page 16: Literature Based Discovery

Binary Association Rules

• XY (confidence, support) • If X Then Y (confidence, support)• Confidence = % of docs containing Y within the X

docs• Support = number (or %) of docs containing both X

and Y• The relation between X and Y not known.• Examples:

– Multiple Sclerosis Optic Neuritis (2.02, 117)– Multiple Sclerosis Interferon-beta (5.17, 300)

Page 17: Literature Based Discovery

Discovery Algorithm

Concept X(Disease)

Concepts Y(Pathologycal or Cell Function, …)

Concepts Z(Genes)

Chromosomal Region

Chromosomal Location

Candidate Gene?

Match

Manifestation Location

Expression Location

Match

Page 18: Literature Based Discovery

Discovery Algorithm• Let X be starting concept of interest.• Find all Y for which X Y.• Find all Z for which Y Z.• Eliminate those Z for which X->Z

already exists.• Eliminate those Z that do not match the

chromosomal region of X• Eliminate those Z that do not match the

expression location of X• Remaining Z are candidates for new

relation between X and Z.

In general:

X Y1 … Yn Z, but not X Z

Example:

X = disease

Y = (pato)physiology of X

Z = (de)regulators of Y (drugs, proteins, genes)

New relation example: Z is candidate gene for disease X

Page 19: Literature Based Discovery

Ranking Concepts Z

X

Y1

Y2

Y3

Yi

Yj

Z1

Z2

Z3

Zk

Zn

s1

( ) ( * )i i k

m

k XY Y Zi

Rank Z S S

Page 20: Literature Based Discovery

Results: Concepts in Medline

• Full Medline (end 2001) analyzed (11,226,520 recs)

• Looking for 19,781 MeSH terms and 22,252 human genes (14,659 from HUGO and 7,593 from LocusLink). 24,613 alias gene symbols added

• Gene symbols found in 2,689,958 Medline recs. Most frequent ambiguous symbols (CT, MR, CO2,…) or format errors

Page 21: Literature Based Discovery

Results: Co-occurring Concepts in Medline

• 29,851,448 distinct pairs of co-occurring concepts:– In 7,106,099 at least one gene symbol appeared– In 679,159 pairs both concepts are gene symbols

• Total co-occurrence frequency: 798,366,684

• 59,702,986 association rules calculated and stored

Page 22: Literature Based Discovery

Bilateral Perisylvian Polymicrogiria - BPP (OMIM:

300388)• Polymicrogyria of the cerebral cortex is

a developmental abnormality characterized by excessive surface convolution

• Clinical characteristics:– Mental retardation– Epilepsy– Pseudobulbar palsy (paralysis of the face,

throat, tongue and the chewing process)

• X linked dominant inheritance

Page 23: Literature Based Discovery
Page 24: Literature Based Discovery

• It is considered a disorder of neuronal migration (unlayered type) or a consequence of intrauterine ischemia (layered type)

BPP - pathogenesis

Page 25: Literature Based Discovery

Finding Candidate Genes for Polymicrogyria, bilateral perisylvan

Page 26: Literature Based Discovery

18 gene candidates

15 gene candidates

Tissue specific expression

2 gene candidates: L1CAM and FLNA

relation between semantic types Cell Movement and Gene or gene products

Sublocalisation in the Xq28

237 genes in Xq28

Page 27: Literature Based Discovery
Page 28: Literature Based Discovery

User Interface “cgi-bin” version

Page 29: Literature Based Discovery

Automatically search for supporting Medline Citations

Page 30: Literature Based Discovery

Cleft Palate – Predicting Candidate Genes

Table 1 - The results of predicting candidate genes for Cleft Palate using filtering by chromosomal and expression location.

Genomic location (GL)

Size of GL (Mb)

No. of genes/ GL

Hits after filtering by

location

Hits after filtering by location/all genes in GL*

After filtering by location and expression

After filtering by location and

expression/ all genes in GL*

1p31 23.4 78 18 18/78 (0.23) 10 10/78 (0.13)

1p36 27.6 327 63 63/327 (0.19) 39 39/327 (0.12)

3p24 15.8 41 6 6/41 (0.15) 3 3/41 (0.07)

4p16 11.0 131 20 20/131 (0.15) 10 10/131 (0.08)

6p24 6.5 33 5 5/33 (0.15) 5 5/33 (0.15)

16p13 16.7 250 47 47/250 (0.19) 25 25/250 (0.10)

Average 16.8 144 26.5 26.5/144 (0.18) 15.4 15.4/144 (0.11)

Page 31: Literature Based Discovery
Page 32: Literature Based Discovery
Page 33: Literature Based Discovery

Summary and Conclusions

• We extend and enhance an existing discovery support system (BITOLA)

• The system can be used as:– Research idea generator, or– Alternative method of searching Medline

• Genetic knowledge about the chromosomal locations of diseases and genes included to make BITOLA more suitable for disease candidate gene discovery

Page 34: Literature Based Discovery

Further Work

• Increase the number of concepts

• Gene symbol disambiguation

• Semantic relations extraction

• System evaluation

• Improve the Web version of the system

Page 35: Literature Based Discovery

System Availability

• URL:

www.mf.uni-lj.si/bitola/

Page 36: Literature Based Discovery

Related work: SemGenTom Rindflesch et al

• Extract semantic predications on genetic basis of disease

• “Deletions of INK4 occur in malignant tumors”– INK4|ASSOCIATED_WITH|Malignant Tumors

• Evaluation and visualization of SemGen output

Page 37: Literature Based Discovery

Semantic Structures

CAUSE PREDISPOSE ASSOCIATED_WITH

<genetic phenomenon> ETIOLOGY_OF <disorder>

causedetermineresult incontrolunderlietransmitresponsible

predisposelead topromotesusceptibilityrisk

associateinvolvelinkimplicateinfluencerelated

Page 38: Literature Based Discovery

a2

abl

all-1

apc

atp

bcrbcr$abl

beta

brca1

brca2

ctcf

dp

e1

e1b

egf

egfr

erb$b1

et

ews

fap

fe

gap43

gsp

hes1

hk1

hpttg

hst

int2

ki$ra

kit

met

mgmt$gstp1

mhc

mlh1msh2

myc

mycn

nramp1

p1$p2

p53

pax

piga

ras

rb1

ret

ret$ptcron

step

tshr

tx

utr

vhl

vhl$rcc$cellwaf1

wnd

wt1

Acute leukemia, NOS

Adenoma

Adenoma liver

Adenomatous Polyposis Coli

Adenovirus Infections

Bone NeoplasmsCARCINOMA OF MOUTH

Carcinoma

Carcinoma of breast

Carcinoma, Papillary

Cell Transformation, Neoplastic

Cervical Cancer

Chronic leukemia, NOS

Colorectal CancerColorectal Tumors

Congenital Abnormality

Disease regression

Functional disorder

Gastrointestinal carcinoma

Growth arrest, NOS

Growth retardation, NOS

Hepatitis, Chronic

Hepatolenticular Degeneration

Hyperplasia

Hyperthyroidism

Hypokinesia

Increased motor activity, NOS

Infection

Iron Overload

Lesion, NOS

Leukemia, Myeloid, Chronic-Phase

Leukemia, Pre-B-Cell

Leukemia, T-Cell

Liver neoplasms

Lung Neoplasms

Lung diseases

Lymphoma, B-Cell

Lymphoma, Follicular

Malignant neoplasm of breast

Malignant neoplasm of pancreas

Marchiafava-Micheli Syndrome

Mastocytosis

Mechanical Detachment

Morphologic abnormality, NOS

Multiple Endocrine Neoplasia

Multisystem disorders

Myeloid Leukemia, Chronic

Neonatal infection NOS

Neoplasm Metastasis

Neoplasm progression NOS

Nervous System Diseases

Neuroblastoma

Neuroendocrine Tumors

Obstruction

Ovarian Cancer

Papillary thyroid carcinoma

Peripheral Nervous System Diseases

Prostatic Neoplasms Recurrent acute tonsillitis

Rheumatoid Arthritis

Skin Cancer

Solid tumorStomach Cancer

Stress

Struck

Syndrome

Thyroid Cancer

Thyroid Hormone Resistance Syndrome

Thyroid carcinoma NOS

Virus Diseases

carcinogenesis

desmoplastic small round-cell tumor

tumor growth

tumorigenesis

Pajek

Page 39: Literature Based Discovery

Statistical Evaluation

• Assoc. rule base divided into 2 segments: older (1990-1995) and newer (1996-1999)

• The system predicts new relations based on the older segment

• Predictions compared with actual new relations in the newer segment

Page 40: Literature Based Discovery

Summary Statistical Evaluation Results

Table 3 – Summary relationship prediction results for the AVGS and 2*AVGS constraints respectively: Precision (correctly predicted among all predicted),

Recall (correctly predicted among all relationships), Better then random (correct predictions of the system divided by random correct predictions).

Disease AVGS 2*AVGS Precision Recall Better then

random Precision Recall Better then

random MS 7.6% 82.0% 1.9 11.6% 57.6% 2.9 TA 3.1% 79.1% 2.8 6.2% 38.5% 4.5 ML 8.9% 80.9% 2.0 13.9% 56.6% 3.1 PD 8.0% 80.3% 2.1 13.3% 52.0% 3.6 IP 1.1% 84.1% 4.1 2.6% 52.3% 11.5 CP 0.5% 83.3% 5.0 0.9% 50.0% 9.0 CMT 3.3% 80.2% 4.4 6.5% 50.4% 8.3 FDH 0.9% 60.9% 7.0 1.3% 34.8% 8.0 NS 2.0% 86.8% 4.9 4.3% 33.8% 11.5 ED 2.9% 77.4% 4.0 4.7% 36.3% 6.4 Average: 3.8% 79.5% 3.8 6.5% 46.2% 6.9

Page 41: Literature Based Discovery

Statistical Evaluation Results

• With no assoc. rules constraints:– predicts almost all new relations, but too many

candidate relations

• With constraints:– predicts new relations 6.9 times better than random

predictions– tighter the constraints,

better (correct / all predictions) ratio (6.5%)