24
Automatically Acquiring a Automatically Acquiring a Linguistically Motivated Linguistically Motivated Genic Interaction Genic Interaction Extraction System Extraction System Mark A. Greenwood Mark Stevenson Yikun Guo Henk Harkema Angus Roberts Natural Language Processing Group Department of Computer Science University of Sheffield, UK

Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Mark A. Greenwood Mark Stevenson Yikun Guo Henk Harkema Angus Roberts

Embed Size (px)

Citation preview

Automatically Acquiring a Automatically Acquiring a Linguistically Motivated Genic Linguistically Motivated Genic Interaction Extraction SystemInteraction Extraction SystemMark A. GreenwoodMark StevensonYikun GuoHenk HarkemaAngus Roberts

Natural Language Processing GroupDepartment of Computer Science

University of Sheffield, UK

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Outline of TalkOutline of Talk

• Background to our Approach

• Extraction Patterns

• Acquiring And Using Extracting Patterns

• Challenge Evaluation

• Analysis

• Conclusions and Future Work

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Background to our ApproachBackground to our Approach

• We had developed a system to perform sentence filtering Sentence filtering involves classifying sentences based on whether or

not they are relevant to a given scenario. We reported F-measure results of approximately 55% on a

management succession task (Stevenson & Greenwood, 2005).

• For participation in the LLL challenge we extended this system We moved to extracting interactions rather than sentence filtering We extended the pattern representation

• Previously we had represented sentences using the verbs in the sentence and their direct arguments.

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Outline of TalkOutline of Talk

• Background to our Approach

• Extraction Patterns

• Acquiring And Using Extracting Patterns

• Challenge Evaluation

• Analysis

• Conclusions and Future Work

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

• We represent extraction patterns as paths in a dependency tree Dependency trees represent text by linking each sentence word with

those words which directly modify it. For example the noun phrase “the brown dog” is represented by two

dependency relations:

In these experiments we used MINIPAR (Lin, 1999) to generate the dependency trees from which the extraction patterns were taken.

The supplied dependency relations were not used due to time constraints of adapting our approach to the task.

Extraction PatternsExtraction Patterns

the brown dog

det

adj

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Extraction PatternsExtraction Patterns

• The nodes in a dependency trees can be either: Lexical items (i.e. words) Semantic categories such as gene, protein, agent, target, etc.

• Lexical items are represented in lower case

• Semantic categories are capitalised

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Extraction PatternsExtraction Patterns

Given the dependency tree representing the phrase “…AGENT represses the transcription of TARGET…” we extract chain shaped paths as extraction patterns.

verb[v/repress](subj[n/AGENT])

verb[v/repress](obj[n/transcription](of[n/TARGET]))

verb[v/repress](obj[n/transcription]+subj[n/AGENT])

verb[v/repress](obj[n/transcription](of[n/TARGET])+subj[n/AGENT])

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Outline of TalkOutline of Talk

• Background to our Approach

• Extraction Patterns

• Acquiring And Using Extracting Patterns

• Challenge Evaluation

• Analysis

• Conclusions and Future Work

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Learning Extraction PatternsLearning Extraction Patterns

Iterative Learning Algorithm

1. Begin with set of seed patterns which are known to be good extraction patterns

2. Compare every other pattern with the ones known to be good

3. Choose the highest scoring of these and add them to the set of good patterns

4. Stop if enough patterns have been learned, else repeat from step 2.

Rank

Patterns

CandidatesSeeds

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Pattern SimilarityPattern Similarity• We determine the similarity between two patterns using a vector

space model inspired by that commonly used in IR. Each pattern can be represented by a set of pattern element-filler pairs The set of pattern element-filler pairs in a corpus forms the basis for a vector

space where the value is 1 if a pattern contains the pair, 0 otherwise.

• The similarity of two patterns can then be computed as:

• This is the cosine measure augmented with a matrix W which lists the similarity between each pattern element-filler pair. The similarity between pattern element-filler pairs is computed using a WordNet

similarity measure proposed by Banerjee and Pederson (2002) referred to as Adapted Lesk.

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Pattern SimilarityPattern SimilarityExtraction Patterns

a. verb[v/block](subj[n/protein])b. verb[v/repress](subj[n/enzyme])c. verb[v/promote](subj[n/protein])

Matrix Labels1. subj_protein, 2. subj_enzyme, 3. verb_block,

4. verb_repress, 5. verb_promote

1 0.95 0 0 0

0.95 1 0 0 0

0 0 1 0.9 0.1

0 0 0.9 1 0.1

0 0 0.1 0.1 1

Similarity Values

sim(a, b) = 0.925sim(a, c) = 0.55sim(b, c) = 0.525

Similarity Matrix

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Acquiring PatternsAcquiring Patterns

• We use this approach to learn patterns containing a known agent or target from the training data.

• The texts are pre-processed to include AGENT and TARGET as semantic class labels.

• We restricted certain terms (e.g. repress) to domain specific senses in WordNet for similarity calculations. We started from the 30 verbs in the PASBio project 28 further other nouns and verbs were also restricted

• At each iteration of the algorithm we accepted up to 4 new patterns which were within 0.95 of the best pattern being accepted.

• The algorithm was allowed to run until no more patterns could be acquired.

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Seed PatternsSeed Patterns

• We used the following manually selected seed patterns in all our experiments:

verb[v/transcribe](by[n/AGENT]+obj[n/TARGET])

verb[v/be](of[n/AGENT]+s[n/expression](of[n/TARGET]))

verb[v/inhibit](obj[n/activity](nn[n/TARGET])+subj[n/AGENT])

verb[v/bind](mod[r/specifically](to[n/TARGET])+subj[n/AGENT])

verb[v/block](obj[n/capacity](of[n/TARGET])+subj[n/AGENT])

verb[v/regulate](obj[n/expression](nn[n/TARGET])+subj[n/AGENT])

verb[v/require](obj[n/AGENT]+subj[n/gene](nn[n/TARGET]))

verb[v/repress](obj[n/transcription](of[n/TARGET])+subj[n/AGENT])

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Extracting RelationsExtracting Relations

• Text from which we wish to extract relations is processed to produce extraction patterns in the same way as before.

• Any pattern which matches an acquired pattern is used to extract information. The acquired patterns match with AGENT and TARGET matching

anything Not all patterns contain both an AGENT and TARGET so post-

processing links part relations together.

• So for example The pattern verb[v/stimulates](subj[n/AGENT]+obj[n/TARGET])

Matches against verb[v/stimulates](subj[n/GerE]+obj[n/cotD])

Resulting in the interaction GerE cotD

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Outline of TalkOutline of Talk

• Background to our Approach

• Extraction Patterns

• Acquiring And Using Extracting Patterns

• Challenge Evaluation

• Analysis

• Conclusions and Future Work

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Challenge EvaluationChallenge Evaluation

• We submitted three runs for evaluation Baseline: A simple baseline system which pairs all dictionary elements

in a sentence with each other in both orders. Basic: A system trained on the basic data set without coreference as

provided for the LLL-05 challenge. Expanded: A system trained on the basic data set augmented with 78

automatically acquired weakly labelled MedLine sentences.

• The basic and expanded systems differ only in the training data used to acquire the extraction patterns.

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Challenge EvaluationChallenge Evaluation

System Precision Recall F-measure

Baseline 10.6% (53/500) 98.1% (53/54) 19.1%

Basic 22.2% (6/27) 11.1% (6/54) 14.8%

Expanded 21.6% (8/37) 14.8% (8/54) 17.5%

• The baseline system did not achieve 100% recall as some constructs, such as “… A activates or represses B…” requires two interactions between A and B to be recognised.

• Both approaches have low recall but a precision twice that of the baseline system.

• While the performance is low it seems that supplying extra training data improves the performance of our approach.

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Outline of TalkOutline of Talk

• Background to our Approach

• Extraction Patterns

• Acquiring And Using Extracting Patterns

• Challenge Evaluation

• Analysis

• Conclusions and Future Work

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

AnalysisAnalysis• If we examine the algorithm at

each iteration instead of just the final result we can see that: The seed patterns are unable to

extract a single interaction, i.e. the initial F-measure is zero.

As the seeds do not extract relations the performance of the system is solely due to the acquired patterns.

The algorithm is fairly resilient to the acquisition of bad patterns, i.e. with few exceptions, the F-measure steadily increases.

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Outline of TalkOutline of Talk

• Background to our Approach

• Extraction Patterns

• Acquiring And Using Extracting Patterns

• Challenge Evaluation

• Analysis

• Conclusions and Future Work

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

ConclusionsConclusions

• We used a pattern representation based on dependency trees and an iterative algorithm to learn representative patterns. The seed patterns were not well suited to the task and future work will

include experimenting with different seed sets. The small amount of training data seems to hinder our approach

(adding 78 extra sentences saw a 2.7% increase in F-measure)

• The similarity measure we adopted seems well suited to this task where similar meaning can be conveyed in different ways.

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Future WorkFuture Work

• We intend to try dependency parsers other than MINIPAR to see if they are more suited to biomedical texts.

• We are already looking at other pattern representations to see if they are more suited to the task/domain.

• We intend to continue our work on sentence filtering as this would provide a useful first step in any extraction system.

Any Questions?Any Questions?

Copies of these slides can be found at:

http://www.dcs.shef.ac.uk/~mark/nlp/pubs/

August 7th 2005 LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

BibliographyBibliography• Satanjeev Banerjee and Ted Pedersen. An Adapted Lesk Algorithm for

Word Sense Disambiguation Using WordNet. In Proceedings of the Fourth International Conference on Computational Linguistics and Intelligent Text Processing (CICLING-02), 2002.

• Mark Craven and Johan Kumlien. Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, 1999.

• Dekan Lin. MINIPAR: a minimalist parser. Maryland Linguistics Colloquium. University of Maryland, College Park. 1999.

• Mark Stevenson and Mark A. Greenwood. A Semantic Approach to IE Pattern Induction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), 2005.

• Tuangthong Wattarujeekrit and Parantu Shah and Nigel Collier. PASBio: Predicate-Argument Structures for Event Extraction in Molecular Biology. BMC BioInformatics, 5:155. 2004.