17
Citances: Citation Sentences for Semantic Analysis of Bioscience Text Preslav I. Nakov, Ariel S. Schwartz, and Marti A. Hearst Computer Science Division and SIMS University of California, Berkeley http://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech

Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Embed Size (px)

DESCRIPTION

Citances: Citation Sentences for Semantic Analysis of Bioscience Text. Preslav I. Nakov, Ariel S. Schwartz, and Marti A. Hearst Computer Science Division and SIMS University of California, Berkeley http://biotext.berkeley.edu. Supported by NSF DBI-0317510 and a gift from Genentech. Overview. - PowerPoint PPT Presentation

Citation preview

Page 1: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Citances: Citation Sentences for Semantic Analysis ofBioscience Text

Preslav I. Nakov, Ariel S. Schwartz, and Marti A. Hearst

Computer Science Division and SIMSUniversity of California, Berkeley

http://biotext.berkeley.edu

Supported by NSF DBI-0317510 and a gift from Genentech

Page 2: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Overview We propose the use of the text of the sentences surrounding

citations as an important tool for semantic interpretation of bioscience text.

We hypothesize several different uses of citation sentences (which we call citances), including the creation of training and testing data for semantic analysis

(especially for entity and relation recognition), synonym set creation, database curation, document summarization, and information retrieval generally.

We illustrate some of these ideas, showing that citations to one document in particular align well with what a hand-built curator extracted.

We also show preliminary results on the problem of normalizing the different ways that the same concepts are expressed within a set of citances, using and improving on existing techniques in automatic paraphrase generation.

Page 3: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Motivation for using Citances in Bioscience Text We are interested in utilizing the large volume

of available bioscience text when designing information extraction and retrieval tools.

While the size of available text is growing rapidly, only few small annotated corpora for the bioscience domain exist.

Full text (as opposed to abstracts) is becoming more available, providing new opportunities for automatic text processing.

Citances provide an opportunity for coping with this limitation. They essentially contain a semi-annotated corpora for free.

Page 4: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

The Nature of Citances in Bioscience Literature Citations are particularly abundant in biosciences. Nearly every statement is backed up with at least one

citation. It is quite common for papers in the bioscience

domain to be cited by 30-100 other papers. The citances tend to state known biological facts with

reference to the original papers that discovered them. The cited facts are typically stated in a more concise

way in the citing papers than in the original papers. As the same facts are repeatedly stated in different

ways in different papers, statistical models can be trained on existing citances to identify similar facts in unseen text.

Page 5: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Examples of Citances

“The genetic data presented here clearly show that the Eiger-induced small eye phenotype depends strongly on the JNK signaling pathway. In mammals, it has been demonstrated that the JNK pathway is essential for the execution of stress-induced cell death. JNK3, a JNK isoform that is selectively expressed in the nervous system, is required for neuronal cell death caused by excitotoxic stress (Yang et al., 1997). Embryonic fibroblasts from mouse deficient for both JNK1 and JNK2 are resistant to UV-stimulated apoptosis (Tournier et al., 2000). Whitfield et al. (2001) have shown that Bim acts downstream of the JNK pathway in NGF-deprivation-induced neuronal cell death. One possible downstream mechanism of the JNK pathway to induce cell death may be transcriptional upregulation of Bim. However, our results suggest the possibility that Eiger-induced cell death signaling may be independent of downstream jun expression, similar to the observation that the effect of UV to cause cell death does not require new gene expression (Tournier et al., 2000). The JNK signaling also mediates heat shock-induced cell death, the execution of which is caspase independent (Gabai et al., 2000). Furthermore, overexpression of the EDA receptor or TAJ/TROY, a member of the TNF receptor superfamily that exhibits extensive homology to the EDA receptor, results in the activation of the JNK pathway and caspase-independent cell death (Eby et al., 2000; Kumar et al., 2001). In some cases, JNK-induced cell death is mediated by the release of mitochondrial apoptogenic factors (Tournier et al., 2000). Recently, it has been shown that cancer cell death induced by TRAIL, a mammalian TNF superfamily ligand, requires mitochondrial release of Smac (Deng et al., 2002). One possible mechanism of Eiger-induced cell death may be JNK-mediated release of mitochondrial caspase-independent cell death factors. In fact, the Drosophila genome also encodes homologs of such molecules: AIF, endo G and HtrA2.” (Igaki et al., EMBO J. 2002 June; 21 (12): 3009–3018)

Page 6: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Illustrating Diagram

Fact 1

…[12]

…[27]

…[42]

…[17]

…[17]

Fact 2

…[23]

…[9]

Fact n

…[16]

…[7]

Page 7: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

A Source for Unannotated Comparable Corpora Comparable corpora are a useful resource for

the development of NLP tools for question answering and summarization.

Most domains outside of news do not contain many articles discussing the same events, but bioscience citances have some of the requisite characteristics in that they include redundancies that allow identification of comparable sentences.

We later demonstrate the use of citances as comparable corpora for automatic paraphrase extraction.

Page 8: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Summarization of the Target Papers The set of citances that refer to a specific paper can be viewed

as an indication of the important facts in the paper as seen by the scientific community in that field.

This is an excellent resource for summarization. In fact, we believe that a paper that is cited enough times can be summarized using only the citances pointing to it.

Instead of showing the user all the citances pointing to a paper (as is done in CiteSeer and in Nanba et al. (2000)), we propose to first cluster related citances, and then display to the user only a summary of each cluster.

The facts expressed by each cluster can be extracted and stored in a database.

This could facilitate answering advanced queries on facts, such as “retrieve all documents that describe which genes upregulate gene G”.

Page 9: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Synonym Identification and Disambiguation Bioscience literature is rife with abbreviations

and synonyms. Citances referring to the same article may

allow synonyms to be identified and recorded. A collection of related citances can help

disambiguate terms with multiple meanings, since in some of the citances an unambiguous form of the term might be present.

Page 10: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Entity Recognition and Relation Extraction Citances provide us a way to build a model of many

of the different ways to express a relationship type R between entities of type A and B.

We can seed learning algorithms with several examples using concepts that are semantically similar to A and similar to B, for which relation R is known to hold.

Then we can train a model to recognize this kind of relation for situations for which the relation is not known.

Since the results may extend to sentences that are not citances as well, citances-based corpora should provide a good collection for building NLP tools for recognizing entities and relations in unseen text.

Page 11: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Targets for Curation

We hypothesize that citances contain the most important information expressed in the cited document, and therefore contain the information that curators would want to make use of.

We have found support for this hypothesis with two sample papers being used by a cancer researcher who is recording information about the process of apoptosis.

Page 12: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Improved Citation Indexes for Information Retrieval Citation indexes can be improved by combining methods that use citances’

context (e.g., Mercer and Di Marco (2004)) with methods that use citances’ content (e.g., Bradshaw (2003)).

For example, indexing terms can be taken from citances referring to a target paper, weighting them both by their relative frequency and the type of citations they appear in.

Page 13: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Related Work

Traditional citation analysis dates back to the 1960’s (Garfield). Includes: Citation categorization, Context analysis, Citer motivation.

Citation indexing systems, such as ISI’s SCI, and CiteSeer. Mercer and Di Marco (2004) propose to improve

citation indexing using citation types. Bradshaw (2003) introduces Reference Directed

Indexing (RDI), which indexes documents using the terms in the citances citing them.

Page 14: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Related Work (cont.)

Teufel and Moens (2002) identify citances to improve summarization of the citing paper. They give lower weight to citances as candidate sentences for summarization.

Nanba et. al. (2000) use citances as features for classifying papers into topics.

Related field to citation indexing is the use of link structure and anchor text of Web pages. Applications include: IR, classification, Web

crawlers, and summarization. See the full paper for references.

Page 15: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Issues for Processing Citances Text span

Identification of the appropriate phrase, clause, or sentence that constructs a citance.

Correct mapping of citations when shown as lists or groups (e.g., “[22-25]”).

Grouping citances by topic Citances that cite the same document should

be group by the facts they state. Normalizing or paraphrasing citances

For IR, summarization, learning synonyms, relation extraction, question answering, and machine translation.

Page 16: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Paraphrasing Citances

Page 17: Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Conclusions We have motivated and discussed the potentially enormous role

that the use of sentences surrounding citations, or citances, can have for automated analysis of bioscience literature.

In work not yet reported, we have found that citances align very well with rich information being curated by hand by a molecular biologist, and suspect they will be equally useful for other curation tasks.

We also hypothesize that it will be a gold mine of data for training algorithms to perform semantic analysis of bioscience text, and will improve the results of querying the bioscience literature.

Much work must be done before citances can be put to full use. We have demonstrated some initial results in paraphrasing

citances that discuss the same topic, but more work remains to be done to improve results, and to group similar citances together.

In future work, we plan to thoroughly explore the possibilities surrounding the analysis and use of citances for bioscience text analysis.