Upload
austin-ross
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
Application of the NLP techniques to IE and IR
CREST言語処理グループ
Outline
Background Building NLP resources
GENIA
Extracting Disease-Gene Associations from MEDLINE H-invitational Extracting DGAs by machine learning
An IR system for predicate-argument relations MEDUSA
Application to the Biomedical domain
Plenty of text MEDLINE database: 12 million abstracts Needs of effective IE and IR
Domain knowledge Gene ontology, KEGG, UMLS, ICD, …
Other Information sources A variety of molecular databases
DNA sequences, motifs, diseases, molecular interactions, etc…
Developing NLP resources
Resources for NLP research Domain knowledge Training data for ML-based techniques Test data for evaluating the transferability of a system
We are now developing… GENIA
Ontology Corpus
GENIA corpus
4,000 MEDLINE abstracts Selected by MeSH Terms (Human, Blood cells, Transcription
factors) XML format Contents
Named-entity (Kim et al 2003) Part-of-speech (Tateisi et al 2004) Parse tree Co-reference (Institute of Infocomm Research,
Singapore)
The peri-kappa B site mediates human immunodeficiency virus type 2 enhancer activation in monocytes …
GENIA named-entity corpus
Terms are annotated based on the semantic classes in the GENIA ontology
Size 2,000 abstracts Number of the terms: 92,723 Vocabulary size: 36,568
DNA virus
cell_type
GENIA part-of-speech corpus
Each token is annotated with its part-of-speech tag. Size
2,000 abstracts 20,544 sentences 50,1054 words (about half the size of Penn Treebank)
The peri-kappa B site mediates human immunodeficiency virus type 2 enhancer activation in monocytes …
DT NN NN NN VBZ JJ NN
NN NN CD NN NN IN NNS
GENIA treebank
Based on the standard of the Penn TreeBank Size
200 abstracts (1500 abstracts at the end of this fiscal year)
CD3-episilon expression is controlled by a downstream T lymphocyte-specific enhancer element
NP ADJP
NP
PP
VP
VP
S
GENIA corpus Used in more than 240 institutions
Japan (28), Asia (54), North America (63), Europe (62), etc… De facto standard for evaluating biomedical named-entity
recognition systems BioNLP workshop at Coling 2004
Named-entity recognition shared task Institute for Infocomm Research (Singapore), Stanford University (USA), University of Edinburgh (UK), University of Wisconsin-Madison (USA), Pohang University of Science and Technology (Korea), University of Alberta (Canada), University Duisburg-Essen (Germany), Korea University (Korea), National Taiwan University (Taiwan),
Outline
Background Building NLP resources
GENIA
Extracting Disease-Gene Associations from MEDLINE H-invitational Extracting DGAs by machine learning
An IR system for predicate-argument relations MEDUSA
H-Invitational Disease Edition
Text-mining Scoring system(PANDA)
Known disease geneGenomic region of interest (GROI)
List of genes
Genes with high score
SNPs1) Public2) Private
Gene expression1) Public2) Private
AND/OR
Final Result
H-InvDBOther DB
Literature(PubMed)
Dictionary
Specific diseaseSelect specific disease
June 25, 2004Disease group, JBIRC
Synthetic analysis
Disease-Gene Associations extracted from MEDLINE
DGA explorer
(demo)
Text
1.5 million MEDLINE abstracts Selected by MeSH Terms
“Disease Category” AND (“Amino Acids, Peptides, and Proteins” OR “Genetic Structures”)
Parsing All the sentences were parsed by the HPSG
parser Using a PC cluster (100 processors with GXP) Time: 10 days
Disease-Gene Associations in texts
These results suggested that targeted disruption of Cyp19 caused anovulation and precocious depletion of ovarian follicles
Furthermore, AML cells with methylated p15(INAK4B) tended to express higher levels of DNMT1 and 3B.
Training data
All foals with OLWS were homozygous for the Ile118Lys EDNRB mutation, and adults that were homozygous were not found.
Dominant radial drusen and Arg345Trp EFEMP1 mutation.
The 5 year overall survival (OS) and event-free survival (EFS) were 94 and 90 +/- 8%, respectively, with a median follow-up of 48 months.
These data may indicate that formation of parathyroid adenoma in young patients is related to a mechanism involving EGFR.
:
All co-occurrences are classified into “relevant” or “irrelevant” by a domain expert.
Maximum entropy learning
Log-linear model
F
iii xf
Zxq
1
exp1
Feature function
Weight
Features Bag-of-words Local context Gene/disease name Predicate-argument
structures :
Features of predicate-argument structures (1)
Dedifferentiation of adenoid cystic carcinoma: report of a case implicating p53 gene mutation.
X gene/disease
ARG2
Features of predicate-argument structures (2)
These results suggested that targeted disruption of Cyp19 caused anovulation and precocious depletion of ovarian follicles.
Furthermore, AML cells with methylated p15(INAK4B) tended to express higher levels of DNMT1 and 3B.
X disease/gene
ARG2ARG1
gene/disease
Extraction accuracy
Training/test data: 2,253 sentences 10-fold cross validation
features recall precision f-score
N/A 1.0 0.351 0.520
+ bag of words 0.733 0.682 0.706
+ local context 0.733 0.695 0.714
+ predicate-argument structures
0.759 0.710 0.733
Outline
Background Building NLP resources
GENIA
Extracting Disease-Gene Associations from MEDLINE H-invitational Extracting DGAs by machine learning
An IR system for predicate-argument relations MEDUSA
MEDUSA: An IR system for predicate-argument structures
Ex. Search a sentence in which the subject of the verb activate is protein.
• Simple: Since the PHO2 Asp-230 mutant mimics Ser-230-phosphorylated PHO2, we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene.
• With a relative pronoun: Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription.
• Coordination: Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fais to localize the RNA to the posterior.
MEDUSAdemonstration
100,000 MEDLINE abstracts Parsed by Enju Genes and diseases are annotated by using
the UMLS dictionary
Summary
GENIA corpus Parts of speech, Named-entities, Parse trees
Extracting gene-disease associations from MEDLINE Machine learning with HPSG parse results
An IR system for predicate-argument structures MEDUSA
Software and resource
GENIA Named entity corpus Part-of-speech corpus Parse tree corpus Co-reference (Singapore) Part-of-speech tagger Named entity tagger (soon) HPSG parse results (100,00 MEDLINE abstracts)
Enju (HPSG parser) MEDUSA LiLFeS Amis