21
 Extracting Complex Biological Events with Rich Graph-Based Feature Sets Jari Björne, Juho Heimonen, Filip Ginter, Antti Airola, Tapio Pahikkala, Tapio Salakoski BioNLP 2009 Workshop Farzaneh Sarafraz 18 June 2009

BioNLP09 Winners

Embed Size (px)

Citation preview

Page 1: BioNLP09 Winners

   

Extracting Complex Biological Eventswith Rich Graph­Based Feature Sets

Jari Björne, Juho Heimonen, Filip Ginter, AnttiAirola, Tapio Pahikkala, Tapio SalakoskiBioNLP 2009 Workshop

Farzaneh Sarafraz18 June 2009

Page 2: BioNLP09 Winners

   

BioNLP'09 Task 1

Events in abstracts Given: gene and gene products (proteins) Wanted: events

− type− trigger− participant(s)− cause (if applicable)

Page 3: BioNLP09 Winners

   

Example

"I kappa B/MAD­3 masks the nuclear localization signal of NF­kappa B p65 and requires the transactivation domain to inhibit NF­kappa B p65 DNA binding. "

Event: negative regulation

Trigger: masks

Theme1: the first p65

Cause: MAD­3

Page 4: BioNLP09 Winners

   

Event Types

Gene expression Transcription Protein Catabolism Localisation Phosphorylation

Binding Regulation Positive regulation Negative regulation

Page 5: BioNLP09 Winners

   

Training and Test Data

Training data: 800 abstracts Development data: 150 abstracts Test data: 260 abstracts

Page 6: BioNLP09 Winners

   

The System

Trigger recognition− Methods similar to NER− Classification

Argument detection− Graph edge selection− Classification

Semantic post­processing− Rule­based

Page 7: BioNLP09 Winners

   

Trigger Detection

Token labelling (one for each type and one ­) 92% of triggers are single token

− Adjacent tokens form a trigger if they appear in the training data

Triggers that share a token:− Combined class: gene expression/pos regulation

A graph node for each trigger− Not duplicated just yet

Page 8: BioNLP09 Winners

   

Classification ­ SVM

Token features− Binary: capitalisation, presence of punctuation or 

numeric characters− Stem− Character bigrams and trigrams− Token is known triggers in training data− All the above for linear and dependency 

“neighbours”

Page 9: BioNLP09 Winners

   

Classification ­ SVM

Frequency features− # of named entities

In sentence In a linear window around the token Bag­of­words count of token texts in the sentence (?)

Dependency chains− Up to depth of 3 from the token are constructed− At each depth both token and frequency features− Plus dep type and sequence of dep types in chain

Page 10: BioNLP09 Winners

   

Two SVMs

“Somewhat”  different feature sets Combined weighted results

“This design should be considered an artifact of the time­constrained, experiment­driven development of the system rather than a principled design”

Page 11: BioNLP09 Winners

   

Precision/Recall trade­off

Undetected trigger ­­> undetected event All triggers have events in the training data ­­> 

bias towards reporting an event for all detected triggers

Adjust P/R explicitly − multiply the negative class by β− find   β experimentally

Page 12: BioNLP09 Winners

   

Edge Detection

Multi­class SVM All potential directed edges

− Event node to named entity− Event node to event node (nested event)− Labelled as theme, cause, or negative

Each edge is predicted independently

Page 13: BioNLP09 Winners

   

Feature Set – Central Concept

Shortest undirected path of syntactic dependencies in the Stanford scheme parse of the sentence.

Page 14: BioNLP09 Winners

   

Feature Set

Token text, POS, entity/event class, dependency (subject)

N­grams: merging the attributes of 2­4− Consecutive tokens− Consecutive dependencies− Each token and two neighbouring dependencies− Each dependency and two neighbouring tokens− One bigram showing direction

Page 15: BioNLP09 Winners

   

Other Features

Individual component features Semantic node features Frequency features

Page 16: BioNLP09 Winners

   

Semantic Post­Processing

Duplicate nodes− Same class and same trigger− Combined trigger

Remove improper arguments Remove directed cycles by removing the 

weakest link

Page 17: BioNLP09 Winners

   

Duplicating Event Nodes

Task restrictions− Two causes,− must have theme,− etc.

Several heuristics

x­th first dependency in shortest path from the event for binding

Page 18: BioNLP09 Winners

   

Results

Page 19: BioNLP09 Winners

   

Compared to Us

Page 20: BioNLP09 Winners

   

What Didn't Work/Wasn't Tried

CRF HMM Removing strong independence assumption Co­reference resolution (4.8%)

Page 21: BioNLP09 Winners

   

End.