Upload
kamea
View
25
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Literature Based Discovery. Dimitar Hristovski dimitar .hristovski @ mf.uni-lj.si Institute of Biomedical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia. Let me introduce myself … Research and Development. BS – Biomedicina Slovenica database - PowerPoint PPT Presentation
Citation preview
Literature Based Discovery
Dimitar Hristovski [email protected]
Institute of Biomedical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
Let me introduce myself … Research and Development
• BS – Biomedicina Slovenica database• Research Evaluation Decision Support System• Medical Information Systems
– Surgical clinics– Genetic laboratory– Biochemical laboratory
• Web User Behaviour Analysis• Data warehousing and OLAP
Motivation
• Overspecialization
• Information overload
• Large databases
• For many diseases the chromosomal region known, but not the exact gene
Background
• Literature-based discovery (Swanson):
Concept X(Disease)
Concepts Y(Pathologycal or Cell Function, …)
Concepts Z(Genes)
New Relation?
Biomedical Discovery Support System (BITOLA)
• Goal: – discover potentially new relations (knowledge) between
biomedical concepts – to be used as research idea generator and/or as– an alternative way to search Medline
• System user (researcher or intermediary):– interactively guides the discovery process– evaluates the proposed relations
Extending and Enhancing Literature Based Discovery
• Goal:– Make literature based discovery more suitable for
disease candidate gene discovery– Decrease the number of candidate relations
• Method:– Integrate background knowledge:
• Chromosomal location of diseases and genes• Gene expression location• Disease manifestation location
Usage Scenarios
• For a disease with known chromosomal location, find a candidate gene
• For a gene, find a disease that might be influenced
• For a disease and gene found to be related by linkage study, find the mechanism of the relation (intermediate concepts should help)
System Overview
Knowledge Base
Concepts
Association Rules
Background Knowledge (Chromosomal Locations, …)
Discovery Algorithm
User Interface
Databases (Medline, LocusLink, HUGO, OMIM, …)
Knowledge Extraction
Databases
• Medline: source of known relationships between biomedical concepts
• Set of concepts:– MeSH (Medical Subject Headings): Controlled dictionary and
thesaurus used for indexing and searching the Medline database– HUGO: official gene symbols, names and aliases– LocusLink: gene symbols, aliases and chr.locations– OMIM: genetic diseases
• UMLS (Unified Medical Language System)• Entrez: used to search PubMed, GenBank, ...• UniGene: gene expression
Knowledge Extraction
• Build master set of concepts (MeSH terms and gene symbols)
• Extract occurrence of concepts from each Medline record (MeSH terms from MH field, gene symbols from Title and Abstract)
• Association rule mining (concept co-occurrence)• Chromosomal location extraction (from
LocusLink and HUGO)• Load into knowledge base
Terminology Problems during Knowledge Extraction
• Gene names
• Gene symbols
• MeSH and genetic diseases
Detected Gene Symbols by Frequency
• type|666548• II|552584• III|201776• component|179643• CT|175973• AT|151337• ATP|147357• IV|123429• CD4|99657• p53|89357• MR|88682• SD|85889• GH|84797• LPS|68982• 59|67272• E2|64616
• 82|63521• AMP|61862• TNF|59343• RA|58818• CD8|57324• O2|56847• ACTH|54933• CO2|53171• PKC|51057• EGF|50483• T3|49632• MS|46813• A2|44896• ER|43212• upstream|41820• PRL|41599
Gene Symbol Disambiguation
• Find MEDLINE docs in which we can expect to find gene symbols
• JD indexing (Susanne Humphrey) as possible solution:– Identifies the semantic context of docs– If semantic context not genetic, then gene symbol
probably false positive• Example of false positive:
– Ethics in a twist: "Life Support", BBC1. BMJ 1999 Aug 7;319(7206):390
– breast basic conserved 1 (BBC1) gene, v.s. BBC1 television station featuring new drama series Life Support
JD Indexing
• JDs are 127 Journal Descriptors (e.g., JDs for journal Hum Mol Genet: Cytogenetics; Genetics, Medical)
• Training set docs (435,000) inherit JDs from journals• Training set provides co-occurrence data between inherited
JDs and:– indexing terms assigned to docs directly– words in docs
• Docs having indexing terms/words occurring often with genetics JDs in tr. set assumed to have genetics context
• Extended to indexing by 134 UMLS semantic types (e.g. Gene or Genome, Gene Function,…)
System Overview
Knowledge Base
Concepts
Association Rules
Background Knowledge (Chromosomal Locations, …)
Discovery Algorithm
User Interface
Databases (Medline, LocusLink, HUGO, OMIM, …)
Knowledge Extraction
Binary Association Rules
• XY (confidence, support) • If X Then Y (confidence, support)• Confidence = % of docs containing Y within the X
docs• Support = number (or %) of docs containing both X
and Y• The relation between X and Y not known.• Examples:
– Multiple Sclerosis Optic Neuritis (2.02, 117)– Multiple Sclerosis Interferon-beta (5.17, 300)
Discovery Algorithm
Concept X(Disease)
Concepts Y(Pathologycal or Cell Function, …)
Concepts Z(Genes)
Chromosomal Region
Chromosomal Location
Candidate Gene?
Match
Manifestation Location
Expression Location
Match
Discovery Algorithm• Let X be starting concept of interest.• Find all Y for which X Y.• Find all Z for which Y Z.• Eliminate those Z for which X->Z
already exists.• Eliminate those Z that do not match the
chromosomal region of X• Eliminate those Z that do not match the
expression location of X• Remaining Z are candidates for new
relation between X and Z.
In general:
X Y1 … Yn Z, but not X Z
Example:
X = disease
Y = (pato)physiology of X
Z = (de)regulators of Y (drugs, proteins, genes)
New relation example: Z is candidate gene for disease X
Ranking Concepts Z
X
Y1
Y2
Y3
Yi
Yj
…
…
Z1
Z2
Z3
Zk
Zn
s1
( ) ( * )i i k
m
k XY Y Zi
Rank Z S S
Results: Concepts in Medline
• Full Medline (end 2001) analyzed (11,226,520 recs)
• Looking for 19,781 MeSH terms and 22,252 human genes (14,659 from HUGO and 7,593 from LocusLink). 24,613 alias gene symbols added
• Gene symbols found in 2,689,958 Medline recs. Most frequent ambiguous symbols (CT, MR, CO2,…) or format errors
Results: Co-occurring Concepts in Medline
• 29,851,448 distinct pairs of co-occurring concepts:– In 7,106,099 at least one gene symbol appeared– In 679,159 pairs both concepts are gene symbols
• Total co-occurrence frequency: 798,366,684
• 59,702,986 association rules calculated and stored
Bilateral Perisylvian Polymicrogiria - BPP (OMIM:
300388)• Polymicrogyria of the cerebral cortex is
a developmental abnormality characterized by excessive surface convolution
• Clinical characteristics:– Mental retardation– Epilepsy– Pseudobulbar palsy (paralysis of the face,
throat, tongue and the chewing process)
• X linked dominant inheritance
• It is considered a disorder of neuronal migration (unlayered type) or a consequence of intrauterine ischemia (layered type)
BPP - pathogenesis
Finding Candidate Genes for Polymicrogyria, bilateral perisylvan
18 gene candidates
15 gene candidates
Tissue specific expression
2 gene candidates: L1CAM and FLNA
relation between semantic types Cell Movement and Gene or gene products
Sublocalisation in the Xq28
237 genes in Xq28
User Interface “cgi-bin” version
Automatically search for supporting Medline Citations
Cleft Palate – Predicting Candidate Genes
Table 1 - The results of predicting candidate genes for Cleft Palate using filtering by chromosomal and expression location.
Genomic location (GL)
Size of GL (Mb)
No. of genes/ GL
Hits after filtering by
location
Hits after filtering by location/all genes in GL*
After filtering by location and expression
After filtering by location and
expression/ all genes in GL*
1p31 23.4 78 18 18/78 (0.23) 10 10/78 (0.13)
1p36 27.6 327 63 63/327 (0.19) 39 39/327 (0.12)
3p24 15.8 41 6 6/41 (0.15) 3 3/41 (0.07)
4p16 11.0 131 20 20/131 (0.15) 10 10/131 (0.08)
6p24 6.5 33 5 5/33 (0.15) 5 5/33 (0.15)
16p13 16.7 250 47 47/250 (0.19) 25 25/250 (0.10)
Average 16.8 144 26.5 26.5/144 (0.18) 15.4 15.4/144 (0.11)
Summary and Conclusions
• We extend and enhance an existing discovery support system (BITOLA)
• The system can be used as:– Research idea generator, or– Alternative method of searching Medline
• Genetic knowledge about the chromosomal locations of diseases and genes included to make BITOLA more suitable for disease candidate gene discovery
Further Work
• Increase the number of concepts
• Gene symbol disambiguation
• Semantic relations extraction
• System evaluation
• Improve the Web version of the system
Related work: SemGenTom Rindflesch et al
• Extract semantic predications on genetic basis of disease
• “Deletions of INK4 occur in malignant tumors”– INK4|ASSOCIATED_WITH|Malignant Tumors
• Evaluation and visualization of SemGen output
Semantic Structures
CAUSE PREDISPOSE ASSOCIATED_WITH
<genetic phenomenon> ETIOLOGY_OF <disorder>
causedetermineresult incontrolunderlietransmitresponsible
predisposelead topromotesusceptibilityrisk
associateinvolvelinkimplicateinfluencerelated
a2
abl
all-1
apc
atp
bcrbcr$abl
beta
brca1
brca2
ctcf
dp
e1
e1b
egf
egfr
erb$b1
et
ews
fap
fe
gap43
gsp
hes1
hk1
hpttg
hst
int2
ki$ra
kit
met
mgmt$gstp1
mhc
mlh1msh2
myc
mycn
nramp1
p1$p2
p53
pax
piga
ras
rb1
ret
ret$ptcron
step
tshr
tx
utr
vhl
vhl$rcc$cellwaf1
wnd
wt1
Acute leukemia, NOS
Adenoma
Adenoma liver
Adenomatous Polyposis Coli
Adenovirus Infections
Bone NeoplasmsCARCINOMA OF MOUTH
Carcinoma
Carcinoma of breast
Carcinoma, Papillary
Cell Transformation, Neoplastic
Cervical Cancer
Chronic leukemia, NOS
Colorectal CancerColorectal Tumors
Congenital Abnormality
Disease regression
Functional disorder
Gastrointestinal carcinoma
Growth arrest, NOS
Growth retardation, NOS
Hepatitis, Chronic
Hepatolenticular Degeneration
Hyperplasia
Hyperthyroidism
Hypokinesia
Increased motor activity, NOS
Infection
Iron Overload
Lesion, NOS
Leukemia, Myeloid, Chronic-Phase
Leukemia, Pre-B-Cell
Leukemia, T-Cell
Liver neoplasms
Lung Neoplasms
Lung diseases
Lymphoma, B-Cell
Lymphoma, Follicular
Malignant neoplasm of breast
Malignant neoplasm of pancreas
Marchiafava-Micheli Syndrome
Mastocytosis
Mechanical Detachment
Morphologic abnormality, NOS
Multiple Endocrine Neoplasia
Multisystem disorders
Myeloid Leukemia, Chronic
Neonatal infection NOS
Neoplasm Metastasis
Neoplasm progression NOS
Nervous System Diseases
Neuroblastoma
Neuroendocrine Tumors
Obstruction
Ovarian Cancer
Papillary thyroid carcinoma
Peripheral Nervous System Diseases
Prostatic Neoplasms Recurrent acute tonsillitis
Rheumatoid Arthritis
Skin Cancer
Solid tumorStomach Cancer
Stress
Struck
Syndrome
Thyroid Cancer
Thyroid Hormone Resistance Syndrome
Thyroid carcinoma NOS
Virus Diseases
carcinogenesis
desmoplastic small round-cell tumor
tumor growth
tumorigenesis
Pajek
Statistical Evaluation
• Assoc. rule base divided into 2 segments: older (1990-1995) and newer (1996-1999)
• The system predicts new relations based on the older segment
• Predictions compared with actual new relations in the newer segment
Summary Statistical Evaluation Results
Table 3 – Summary relationship prediction results for the AVGS and 2*AVGS constraints respectively: Precision (correctly predicted among all predicted),
Recall (correctly predicted among all relationships), Better then random (correct predictions of the system divided by random correct predictions).
Disease AVGS 2*AVGS Precision Recall Better then
random Precision Recall Better then
random MS 7.6% 82.0% 1.9 11.6% 57.6% 2.9 TA 3.1% 79.1% 2.8 6.2% 38.5% 4.5 ML 8.9% 80.9% 2.0 13.9% 56.6% 3.1 PD 8.0% 80.3% 2.1 13.3% 52.0% 3.6 IP 1.1% 84.1% 4.1 2.6% 52.3% 11.5 CP 0.5% 83.3% 5.0 0.9% 50.0% 9.0 CMT 3.3% 80.2% 4.4 6.5% 50.4% 8.3 FDH 0.9% 60.9% 7.0 1.3% 34.8% 8.0 NS 2.0% 86.8% 4.9 4.3% 33.8% 11.5 ED 2.9% 77.4% 4.0 4.7% 36.3% 6.4 Average: 3.8% 79.5% 3.8 6.5% 46.2% 6.9
Statistical Evaluation Results
• With no assoc. rules constraints:– predicts almost all new relations, but too many
candidate relations
• With constraints:– predicts new relations 6.9 times better than random
predictions– tighter the constraints,
better (correct / all predictions) ratio (6.5%)