View
21
Download
0
Category
Preview:
DESCRIPTION
A System for Finding Biological Entities that Satisfy Certain Conditions from Texts. Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng SUNY at Binghamton. Outline. Problem statement Techniques and methods Experimental results Discussion and conclusion. Problem statement. - PowerPoint PPT Presentation
Citation preview
A System for Finding Biological Entities that Satisfy Certain Conditions from Texts
Wei Zhou, Clement Yu
University of Illinois at Chicago
Weiyi, Meng
SUNY at Binghamton
1
Outline
Problem statement Techniques and methods Experimental results Discussion and conclusion
2CIKM 2008 By Clement Yu from UIC
Problem statement
Given a complex biological question, output relevant passages (or excerpts) where the answer can be found.
3CIKM 2008 By Clement Yu from UIC
What [GENES] are involved in insect segmentation?
A sample question:
A sample relevant passage:An Example
4CIKM 2008 By Clement Yu from UIC
In all insect species examined, neural expression of hb is conserved, suggesting that a neural function is ancestral. However, as the expression of the eve and ftz genes during segmentation is not conserved between grasshopper and Drosophila, and these genes lie below gap genes such as hb in the Drosophila segmentation hierarchy, it was unclear whether the role of hb in AP patterning would be conserved in more basal insects.
Target: GENESQualification concepts: 1) insect 2) segmentation
[hb, ftz, and eve are targets found in the passage]
Technique and methods
Identify concepts in queries and texts Use of domain knowledge Related concepts (query expansion) Gene symbol disambiguation Conceptual IR models
5CIKM 2008 By Clement Yu from UIC
In texts Window size: all component words
appear within a certain window size.
An example:
...Women who are postmenopausal and who have never used hormone replacement therapy have a higher risk of colon , but not rectal, cancer than do women who ...”,
[Query concept: colon cancer]
Identify concepts in queries and texts
In queries
PubMed automatic term mapping
6CIKM 2008 By Clement Yu from UIC
Use of domain knowledge
Gene/protein species control (rule-based): if a query is asking for genes/proteins related to a specific species, then genes/proteins related to other species are considered irrelevant.
Example:
Query: What [GENES] are involved axon guidance in C.elegans?
An irrelevant passage because of a different species:
“We describe DPTP52F, which is probably the last remaining RPTP encoded in the Drosophila genome. Ptp52F mutations cause specific CNS and motor axon guidance phenotypes, and exhibit genetic interactions with mutations in the other Rptp genes”.
[Ptp52F is not a relevant target because the passage is about Drosophila, not C.elegans]
7CIKM 2008 By Clement Yu from UIC
Use of domain knowledge
Compilation of Instances from Thesauruses: Retrieve concepts from UMLS, genes
from Entrez gene and map them to the TREC entity types.
An example:[Target types]: TUMOR TYPES
[Dictionary]: UMLS Metathesaurus
[Instances]: Lung Cancer; T-cell lymphoma; Pheochromocytoma
8CIKM 2008 By Clement Yu from UIC
Related concepts
Synonyms Hyponyms (one-level only) Hypernyms (one-level only) Lexical variants Related abbreviations
9CIKM 2008 By Clement Yu from UIC
Related concepts: lexical variants
Type 1:
Automatically generate lexical variants using manually created heuristics:
e.g., PLA2
PLA 2, PLAII, and PLA II
Note: PLA2: Phospholipase A2
10CIKM 2008 By Clement Yu from UIC
Related concepts: lexical variants
Type 2:
Retrieve additional lexical variants from a term database of MEDLINE
e.g., PLA2 PL-A2
Note: PLA2: Phospholipase A2
11CIKM 2008 By Clement Yu from UIC
Related concepts – Lexical variants
12CIKM 2008 By Clement Yu from UIC
6 sub types of Type 3
Type 3.1:Identical after stemming Example: APC: "antigen presenting cell" ≈ "antigen presented cell"
Type 3.2: Different by a small edit distance Example: HPV: "Human papillomavirus" ≈ "Human papillomaviral"
Type 3.3: Identical after normalization Example: NFkb: "Nuclear factor kappa beta" ≈ "Nuclear factor kb"
Type 3.4: Different ordering Example: Abeta: "amyloid beta protein“ ≈ "beta amyloid protein"
Type 3.5: Extra words Example: ACD: "cerebral amyloid angiopathies" ≈ "cerebral beta amyloid angiopathies"
Type 3.6: Internal abbreviations Example: APC: "ag presenting cell" ≈ "antigen presenting cell"
Type 3:
Retrieve additional lexical variants by recognizing equiv. long-forms of an abbr.
Related concepts: related abbreviations Abbreviations whose definitions (or long-
forms) consume the query concept.
For example some related abbreviations for concept “lung cancer” are):
SCLC (small cell lung cancer) LCSS (lung cancer symptom scale) NSCLC(non-small cell lung cancer)
CIKM 2008 By Clement Yu from UIC 13
Gene symbol disambiguation
CIKM 2008 By Clement Yu from UIC 14
3 simple rules are defined to disambiguate gene symbols from
Abbreviations of non-gene meanings (Rule 1 & 2)
Example: “Here, utilizing non-obese diabetic (NOD) mice deficient for CD154 (CD154-KO/NOD), we have identified a mandatory role of CD4 T cells as the functional source of CD154 in the initiation of T1DM. ” [NOD is a gene symbol, but it has a non-gene meaning here because it has a non-gene definition “non-obese diabetic”]
Common English words (Rule 3) Example: “The Kit gene, which codes for the KIT ligand (KITL) receptor or stem cell factor,
was one of the genes identified in this study. ” [“Kit” is a common English word, but it has a gene meaning here because of the adjacent word “gene”]
Conceptual IR Models
Model 1 Differentiate target instances
Model 2 Equally weight target instances
CIKM 2008 By Clement Yu from UIC 15
Conceptual IR Models – Model 1
CIKM 2008 By Clement Yu from UIC 16
Conceptual IR Models – Model 2
CIKM 2008 By Clement Yu from UIC 17
Experimental results
Data sets and evaluation metrics Impact of different techniques and methods Comparison with best reported results
CIKM 2008 By Clement Yu from UIC 18
Data sets and evaluation metrics Query collection: 36 questions collected from
biologists in 2007.
Document collection: 162,259 Highwire full-text
documents in HTML format. Performance Metrics
Passage MAP Aspect MAP Document MAP
CIKM 2008 By Clement Yu from UIC 19
Impact of different techniques and methods
CIKM 2008 By Clement Yu from UIC 20
Impact of different techniques and methods
CIKM 2008 By Clement Yu from UIC 21
Comparison with best reported results
CIKM 2008 By Clement Yu from UIC 22
The improvement of our result over the best reported results is significant (22% for automatic and 16.7% for non-automatic in passage retrieval).
Summary
Studied five different levels of related concepts for query expansion and examined their impacts on retrieval effectiveness.
Achieved significant improvement over the best reported results
Compared two conceptual IR models in retrieval effectiveness
Evaluated a simple method for gene symbol disambiguation
23CIKM 2008 By Clement Yu from UIC
Conclusions
1. Incorporating domain-specific knowledge through query expansion using multiple semantic relations significantly improved the retrieval effectiveness.
24CIKM 2008 By Clement Yu from UIC
Conclusions
2: The biggest improvement comes from the lexical variants. This result also indicates that biologists are likely to use different variants of the same concept according to their own writing preferences and these variants might not be collected in the existing biomedical thesauruses.
25CIKM 2008 By Clement Yu from UIC
Future work
Improve the quality of target instances retrieved from different resources
Improve gene symbol disambiguation method
Handle pronouns More evaluations on other gold standards
26CIKM 2008 By Clement Yu from UIC
Questiosn
Thanks
CIKM 2008 By Clement Yu from UIC 27
Recommended