Upload
cole-pollard
View
17
Download
2
Embed Size (px)
DESCRIPTION
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing. Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley http://biotext.berkeley.edu. - PowerPoint PPT Presentation
Citation preview
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing
Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst
Computer Science Division and SIMSUniversity of California, Berkeley
http://biotext.berkeley.edu
Supported by NSF DBI-0317510 and a gift from Genentech
Plan
Overview
Noun compound (NC) bracketing
Problems with Web Counts
Layers of annotation
Applying LQL to NC bracketing
Evaluation
Overview
Motivation: Need to re-use results of NLP processing: for additional processing for end applications: data mining etc.
Proposed solution: Layers of annotations over text
Illustration: Application to noun compound bracketing
Plan
Overview
Noun compound (NC) bracketing
Problems with Web Counts
Layers of annotation
Applying LQL to NC bracketing
Evaluation
Noun Compound Bracketing
(a) [ [ liver cell ] antibody ] (left bracketing)
(b) [ liver [cell line] ] (right bracketing)
In (a), the antibody targets the cell line. In (b), the cell line is derived from the liver.
Related Work
Pustejosky et al. (1993) adjacency model: Pr(w1|w2) vs. Pr(w2|w3)
Lauer (1995) dependency model: Pr(w1|w3) vs. Pr(w2|w3)
Keller & Lapata (2004): use the Web unigrams and bigrams
Nakov & Hearst (2005): will be presented at coNLL! use the Web, Chi-squared n-grams paraphrases surface features
Nakov & Hearst (2005)
Web page hits: proxy for n-gram frequencies
Sample surface features amino-acid sequence left brain stem’s cell left brain’s stem cell right
Majority vote to combine different models
Accuracy 89.34%
Plan
Overview
Noun compound (NC) bracketing
Problems with Web Counts
Layers of annotation
Applying LQL to NC bracketing
Evaluation
Web Counts: Problems
The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care)
“health”: returns nouns “care”: returns both verbs and nouns can be adjacent by chance can come from different sentences
Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition
Page hits are inaccurate
Plan
Overview
Noun compound (NC) bracketing
Problems with Web Counts
Layers of annotation
Applying LQL to NC bracketing
Evaluation
Solution: MEDLINE+LQL
MEDLINE: ~13 million abstracts We annotated:
1.4 million abstracts ~10 million sentences ~320 million annotations
Layered Query Language: demo at ACL! http://biotext.berkeley.edu/lql/
The System
Built on top of an RDBMS system
Supports layers of annotations over text hierarchical, overlapping cannot be represented by a single-file XML
Specialized query language LQL (Layered Query Language)
Plan
Overview
Noun compound (NC) bracketing
Problems with Web Counts
Layers of annotation
Applying LQL to NC bracketing
Evaluation
Noun Compound Extraction (1)
FROM
[layer=’shallow_parse’ && tag_type=’NP’
ˆ [layer=’pos’ && tag_type="noun"]
[layer=’pos’ && tag_type="noun"]
[layer=’pos’ && tag_type="noun"] $
] AS compound
SELECT compound.content
layers’ beginnings
should match
layers’ endings should match
Noun Compound Extraction (2)
SELECT LOWER(compound.content) AS lc, COUNT(*) AS freqFROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content END_LQLGROUP BY lcORDER BY freq DESC
Noun Compound Extraction (3)
SELECT LOWER(compound.content) AS lc, COUNT(*) AS freqFROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ ( { ALLOW GAPS } ![layer=’pos’ && tag_type="noun"] ( [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] ) $ ) $ ] AS compound SELECT compound.content END_LQLGROUP BY lcORDER BY freq DESC
layer negation
artificial range
Finding Bigram Counts
SELECT COUNT(*) AS freq
FROM
BEGIN_LQL
FROM
[layer=’shallow_parse’ && tag_type=’NP’
[layer=’pos’ && tag_type="noun“ &&
content="immunodeficiency"] AS word1
[layer=’pos’ && tag_type="noun“ &&
(content="virus"||content="viruses")]
]
] SELECT word1.content
END_LQL
GROUP BY lc
ORDER BY freq DESC
Paraphrases
Types of paraphrases (Warren,1978): Prepositional
immunodeficiency virus in humans right Verbal
virus causing human immunodeficiency left immunodeficiency virus found in humans left
Copula immunodeficiency virus that is human right
Prepositional Paraphrases
SELECT LOWER(prep.content) lp, COUNT(*) AS freq FROM
BEGIN_LQL FROM [layer=’sentence’ [layer=’pos’ && tag_type="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_type="noun" && content IN ("virus","viruses")] [layer=’pos’ && tag_type=’IN’] AS prep ?[layer=’pos’ && tag_type=’DT’ && content IN ("the","a","an")] [layer=’pos’ && tag_type="noun" && content IN ("human", "humans")] ] SELECT prep.content END_LQLGROUP BY lp, ORDER BY freq DESC
optional layer
Plan
Overview
Noun compound (NC) bracketing
Problems with Web Counts
Layers of annotation
Applying LQL to NC bracketing
Evaluation
Evaluation
obtained 418,678 noun compounds (NCs) annotated the top 232 NCs (after cleaning)
agreement 88% kappa .606
baseline (left): 83.19% n-grams: Pr, #, χ2
prepositional paraphrases for inflections, we used UMLS
Discussion
Semantics of bone marrow cells top verbal paraphrases
cells derived from bone marrow (22 instances) cells isolated from bone marrow (14 instances)
top prepositional paraphrases cells in bone marrow (456 instances) cells from bone marrow (108 instances)
Finding hard examples for NC bracketing w1w2w3 such that both w1w2 and w2w3 are
MeSH terms