24
Application of the NLP techniques to IE and IR CREST 言言言言言言言言

Application of the NLP techniques to IE and IR CREST

Embed Size (px)

Citation preview

Page 1: Application of the NLP techniques to IE and IR CREST

Application of the NLP techniques to IE and IR

CREST言語処理グループ

Page 2: Application of the NLP techniques to IE and IR CREST

Outline

Background Building NLP resources

GENIA

Extracting Disease-Gene Associations from MEDLINE H-invitational Extracting DGAs by machine learning

An IR system for predicate-argument relations MEDUSA

Page 3: Application of the NLP techniques to IE and IR CREST

Application to the Biomedical domain

Plenty of text MEDLINE database: 12 million abstracts Needs of effective IE and IR

Domain knowledge Gene ontology, KEGG, UMLS, ICD, …

Other Information sources A variety of molecular databases

DNA sequences, motifs, diseases, molecular interactions, etc…

Page 4: Application of the NLP techniques to IE and IR CREST

Developing NLP resources

Resources for NLP research Domain knowledge Training data for ML-based techniques Test data for evaluating the transferability of a system

We are now developing… GENIA

Ontology Corpus

Page 5: Application of the NLP techniques to IE and IR CREST

GENIA corpus

4,000 MEDLINE abstracts Selected by MeSH Terms (Human, Blood cells, Transcription

factors) XML format Contents

Named-entity (Kim et al 2003) Part-of-speech (Tateisi et al 2004) Parse tree Co-reference (Institute of Infocomm Research,

Singapore)

Page 6: Application of the NLP techniques to IE and IR CREST

The peri-kappa B site mediates human immunodeficiency    virus type 2 enhancer activation in monocytes …

GENIA named-entity corpus

Terms are annotated based on the semantic classes in the GENIA ontology

Size 2,000 abstracts Number of the terms: 92,723 Vocabulary size: 36,568

    DNA virus

cell_type

Page 7: Application of the NLP techniques to IE and IR CREST

GENIA part-of-speech corpus

Each token is annotated with its part-of-speech tag. Size

2,000 abstracts 20,544 sentences 50,1054 words (about half the size of Penn Treebank)

The peri-kappa B site mediates human immunodeficiency virus type 2 enhancer activation in monocytes …

DT NN NN NN VBZ JJ NN

NN NN CD NN NN IN NNS

Page 8: Application of the NLP techniques to IE and IR CREST

GENIA treebank

Based on the standard of the Penn TreeBank Size

200 abstracts (1500 abstracts at the end of this fiscal year)

CD3-episilon expression is controlled by a downstream T lymphocyte-specific enhancer element

NP ADJP

NP

PP

VP

VP

S

Page 9: Application of the NLP techniques to IE and IR CREST

GENIA corpus Used in more than 240 institutions

Japan (28), Asia (54), North America (63), Europe (62), etc… De facto standard for evaluating biomedical named-entity

recognition systems BioNLP workshop at Coling 2004

Named-entity recognition shared task Institute for Infocomm Research (Singapore), Stanford University (USA), University of Edinburgh (UK), University of Wisconsin-Madison (USA), Pohang University of Science and Technology (Korea), University of Alberta (Canada), University Duisburg-Essen (Germany), Korea University (Korea), National Taiwan University (Taiwan),

Page 10: Application of the NLP techniques to IE and IR CREST

Outline

Background Building NLP resources

GENIA

Extracting Disease-Gene Associations from MEDLINE H-invitational Extracting DGAs by machine learning

An IR system for predicate-argument relations MEDUSA

Page 11: Application of the NLP techniques to IE and IR CREST

H-Invitational Disease Edition

Text-mining Scoring system(PANDA)

Known disease geneGenomic region of interest (GROI)

List of genes

Genes with high score

SNPs1) Public2) Private

Gene expression1) Public2) Private

AND/OR

Final Result

H-InvDBOther DB

Literature(PubMed)

Dictionary

Specific diseaseSelect specific disease

June 25, 2004Disease group, JBIRC

Synthetic analysis

Page 12: Application of the NLP techniques to IE and IR CREST

Disease-Gene Associations extracted from MEDLINE

DGA explorer

(demo)

Page 13: Application of the NLP techniques to IE and IR CREST

Text

1.5 million MEDLINE abstracts Selected by MeSH Terms

“Disease Category” AND (“Amino Acids, Peptides, and Proteins” OR “Genetic Structures”)

Parsing All the sentences were parsed by the HPSG

parser Using a PC cluster (100 processors with GXP) Time: 10 days

Page 14: Application of the NLP techniques to IE and IR CREST

Disease-Gene Associations in texts

These results suggested that targeted disruption of Cyp19 caused anovulation and precocious depletion of ovarian follicles

Furthermore, AML cells with methylated p15(INAK4B) tended to express higher levels of DNMT1 and 3B.

Page 15: Application of the NLP techniques to IE and IR CREST

Training data

All foals with OLWS were homozygous for the Ile118Lys EDNRB mutation, and adults that were homozygous were not found.

Dominant radial drusen and Arg345Trp EFEMP1 mutation.

The 5 year overall survival (OS) and event-free survival (EFS) were 94 and 90 +/- 8%, respectively, with a median follow-up of 48 months.

These data may indicate that formation of parathyroid adenoma in young patients is related to a mechanism involving EGFR.

All co-occurrences are classified into “relevant” or “irrelevant” by a domain expert.

Page 16: Application of the NLP techniques to IE and IR CREST

Maximum entropy learning

Log-linear model

F

iii xf

Zxq

1

exp1

Feature function

Weight

Features Bag-of-words Local context Gene/disease name Predicate-argument

structures :

Page 17: Application of the NLP techniques to IE and IR CREST

Features of predicate-argument structures (1)

Dedifferentiation of adenoid cystic carcinoma: report of a case implicating p53 gene mutation.

X gene/disease

ARG2

Page 18: Application of the NLP techniques to IE and IR CREST

Features of predicate-argument structures (2)

These results suggested that targeted disruption of Cyp19 caused anovulation and precocious depletion of ovarian follicles.

Furthermore, AML cells with methylated p15(INAK4B) tended to express higher levels of DNMT1 and 3B.

X disease/gene

ARG2ARG1

gene/disease

Page 19: Application of the NLP techniques to IE and IR CREST

Extraction accuracy

Training/test data: 2,253 sentences 10-fold cross validation

features recall precision f-score

N/A 1.0 0.351 0.520

+ bag of words 0.733 0.682 0.706

+ local context 0.733 0.695 0.714

+ predicate-argument structures

0.759 0.710 0.733

Page 20: Application of the NLP techniques to IE and IR CREST

Outline

Background Building NLP resources

GENIA

Extracting Disease-Gene Associations from MEDLINE H-invitational Extracting DGAs by machine learning

An IR system for predicate-argument relations MEDUSA

Page 21: Application of the NLP techniques to IE and IR CREST

MEDUSA: An IR system for predicate-argument structures

Ex. Search a sentence in which the subject of the verb activate is protein.

• Simple: Since the PHO2 Asp-230 mutant mimics Ser-230-phosphorylated PHO2, we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene.

• With a relative pronoun: Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription.

• Coordination: Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fais to localize the RNA to the posterior.

Page 22: Application of the NLP techniques to IE and IR CREST

MEDUSAdemonstration

100,000 MEDLINE abstracts Parsed by Enju Genes and diseases are annotated by using

the UMLS dictionary

Page 23: Application of the NLP techniques to IE and IR CREST

Summary

GENIA corpus Parts of speech, Named-entities, Parse trees

Extracting gene-disease associations from MEDLINE Machine learning with HPSG parse results

An IR system for predicate-argument structures MEDUSA

Page 24: Application of the NLP techniques to IE and IR CREST

Software and resource

GENIA Named entity corpus Part-of-speech corpus Parse tree corpus Co-reference (Singapore) Part-of-speech tagger Named entity tagger (soon) HPSG parse results (100,00 MEDLINE abstracts)

Enju (HPSG parser) MEDUSA LiLFeS Amis