View
49
Download
3
Category
Preview:
Citation preview
Learning to Extract Relations
Learning to Extract Relations for ProteinAnnotation
Jee-Hyub Kima, Alex Mitchellb,c , Teresa K. Attwoodb,c ,Melanie Hilarioa
aUniversity of Geneva
bUniversity of Manchster
cEuropean Bioinformatics Institute
ISMB/ECCB2007, 23 July 2007
1 / 32
Learning to Extract Relations
Contents
Introduction
Related Work
Problem and Approach
Methods
Experimental Results
Conclusion
2 / 32
Learning to Extract Relations
Introduction
Protein AnnotationI Definition
I Given a protein sequence or name, describe the proteinwith all relevant information
I Two main methodsI Sequence analysis method (e.g. BLAST, CLUSTALW, etc.)I Literature analysis method
I Traditionally, done manually by human annotators, andautomation needed
I Text-mining has been used for automation.
I Two main tasks in text-miningI Information retrieval (IR): to retrieve relevant documentsI Information extraction (IE): to extract certain pieces of
information from text for pre-defined entities and theirrelations.
3 / 32
Learning to Extract Relations
Introduction
Adaptive Information Extraction
I IE is domain-specific.I Generally, one developed system cannot be used for a new
domain.
I Developing IE systems requires a significant amount ofdomain knowledge.
I Developing IE rulesI Defining relationsI These are two main bottlenecks.
I Still many new domains need to develop their own IEsystems.
I e.g., cell cycle, tissue specificity, etc.
4 / 32
Learning to Extract Relations
Related Work
IE System Development: Developing IE Rules
I Kowledge engineering (KE) based approachI Knowledge engineers write hand-crafted rules with the help
of domain experts (e.g., biologists).I Not scalable
I Machine learning (ML) based approachI To increase robustness and coverage of IE rulesI ML has been used to learn IE rules.
I Annotated corpora, pre-labelled corpora, raw corpora.
I Labor required to develop IE systemsI KE > ML (annotated corpora > pre-labelled corpora > raw
corpora)
5 / 32
Learning to Extract Relations
Related Work
IE System Development: Defining Relations
I All the previous IE systems assume relations to beextracted are already defined.
I It is hard to specify precisely all possible relations toextract, especially in complex and dynamically-evolvingdomains (e.g., biological domain)
I Positioning (ML-based approach)
Corpora \ Relations Pre-defined Not defined
Annotated Soderland (1999), Freitag (2000),
Califf and Money (2003)
Pre-labeled Riloff (1996) Our work
Raw Hasegawa et al. (2004) Collier (1996)
6 / 32
Learning to Extract Relations
Problem and Approach
Our Problem
I GoalI To alleviate the burden of developing IE systems for
biologists
I Problem definitionI Given relevant sentences that describe protein X in terms
of any topic Y and irrelevant sentencesI Learn to extract relations for protein annotation
I Two sub-problemsI What to extractI How to extract
7 / 32
Learning to Extract Relations
Problem and Approach
Bottom-up Approach
Figure: From sentences to relations
8 / 32
Learning to Extract Relations
Methods
System Architecture
Figure: Overall IE system architecture9 / 32
Learning to Extract Relations
Methods
Analyzing Sentences: MBSP
I Memory-Based Shallow Parser (MBSP)I Developed by Walter DaelemansI Extended with named entity taggers and SVO relation finder
I Provides various types of information: POS, SVO, NE, etc.
I Adapted to the biological domain on the basis of the GENIAcorpus
I 97.6% accuracy on POS taggingI 71.0% accuracy on protein named entity recognition
10 / 32
Learning to Extract Relations
Methods
Analyzing Sentences: ExampleI Example
I INPUT: Examples of this are the RNA-binding protein containingthe RNA-binding domain (RBD) ...
I OUTPUT:
Chunk Syntactic Semantic SVO relation
Examples noun phrase subject of ’are’
of preposition
this noun phrase
are verb phrase
the RNA-binding protein noun phrase protein subject of ’contain’
containing verb phrase
the RNA-binding domain (RBD) noun phrase domain object of ’contain’
11 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Inductive Logic Programming (ILP)
I We applied ILP to learn IE rules.I ILP is a ML algorithm that induces rules from examples.
I Outputs are readable and interpretable by the domainexperts.
I Can deal with relational information (e.g., parse trees).
I Problem Setting
B ∧ H |= E
I Given B (Background Knowledge) and E (Examples), find H(Hypothesis).
12 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Representation
I B (Background Knowledge)I Linguistic heuristics (for single-slot IE pattern)
I e.g., <subj> verb, verb <dobj>, verb preposition <np>, nounprep <np>, etc.
I Sentence descriptions (ie., analyzed sentences)
I E (Examples)I Positive and negative examples (ie., relevant and irrelevant
sentences)
I H (Hypothesis)I A set of IE rules
13 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Generalization
Figure: From a sentence to an IE rule
14 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Step 1
Figure: From a sentence to an IE rule
15 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Step 2
Figure: From a sentence to an IE rule
16 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Step 3
Figure: From a sentence to an IE rule
17 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Step 4
Figure: From a sentence to an IE rule
WRAcc(Rule) = coverage(Rule) ∗ (accuracy(Rule)− accuracy(Head ← true))
18 / 32
Learning to Extract Relations
Methods
Selecting IE Rules
I Why is this step necessary?I Rules are learned to classify sentences. Not to extract
information.I Spurious IE rules are learned from the previous step.I Need to be filtered out by domain experts.
I Rules are provided to users with information.I Example
I RULE: <subj:*> vp:contain & vp:contain <dobj:domain> [9, 0.9]I S: Myocilin is a secreted glycoprotein that forms multimers and contains a
leucine zipper and an olfactomedin domain.
19 / 32
Learning to Extract Relations
Methods
Transforming Rules
I Once IE rules are selected, transformed into relations.I Mapping between IE rules and relations
IE Rules Relations
extract argument
trigger relation name
syntactic tag argument position
I ExampleI <subj:*>[X] vp:contain & vp:contain <dobj:domain>[Y] → contain(X,Y)
20 / 32
Learning to Extract Relations
Methods
Grouping Rules
I Post-processing rules
Pattern Example
A verb (active form) B A activate B
B be ed-participle by A B be activated by A
nominal form (with suffix -tion) of verb of B by A activation of B by A
A be nominal form (with suffix -or) of verb of B A is an activator of B
A be ... that verb (active form) B A is ... that activates B
21 / 32
Learning to Extract Relations
Methods
Applying Rules to Extract Relations
I Now, we have a set of IE rules for each relation.I IE rule and relation
I RULE: <subj:protein>[X] vp:promote & vp:promote <dobj:disease>[Y]I TRIGGER: promoteI RELATION: promote(X,Y)
I ExampleI INPUT: Our data demonstrate that PKC beta II promotes colon cancer, at
least in part, through induction of Cox-2, suppression of TGF-betasignaling, and establishment of a TGF-beta-resistant, hyperproliferativestate in the colonic epithelium.
I OUTPUT: promote(’PKC beta II’,’colon cancer’)
22 / 32
Learning to Extract Relations
Experimental Results
Experiments
I Our IE system was applied for PRINTS (a protein familydatabase) annotation.
I Development Corpora
Topic Positives Negatives Class Distribution
Disease 777 1403 36-64%
Function 1268 2625 33-67%
Structure 1159 1750 40-60%
I 80% for training, 20% for test
23 / 32
Learning to Extract Relations
Experimental Results
EvaluationI All the extracted relations are manually evaluated by
domain experts.I Results
Topic Learned Selected Relations Precision Recall F1-measure
rules rules
Disease 55 32 21 75 18.3 29.4
Function 125 64 23 66.3 15.1 24.6
Structure 146 76 20 85.3 61 71.1
I cf. a recent work of extracting regulatory gene/proteinnetworks
I F1-measure of 44% (Saric et al, 2006)
24 / 32
Learning to Extract Relations
Experimental Results
Protein Annotation Examples
I Discovered relations
Disease be_associated, is_a, be_mutated, be_caused, be_deleted, contribute, etc.
Function induce, block, mediate, is_a, belong, act, etc.
Structure contain, form, share, lack, bind, encode, be_conserved, etc.
I Annotation example for protein NF-kappaB
be_implicated(,’NF-kappaB’, in:’the pathogenesis’)
regulate(’IkappaBalpha’, ’NF-kappaB’), activate(’BCMA’, ’NF-kappaB’)
be_composed(’NF-kappaB’, of:’heterodimeric complexes’)
25 / 32
Learning to Extract Relations
Experimental Results
Limitations
I Failure to find inverse relationI S: Expression of uPAR in tumor extracts also inversely correlates with
prognosis in many forms of cancer.I RULE: np:expression <of:protein>[X] & vp:correlate <with:prognosis>[Y]I RELATION: correlate(expression(’expression’,of:’uPAR’), with:’prognosis’)
I Anaphora problemI S: Whereas the overall structure resembles that of the NF-kappaB
p50-DNA complex , pronounced differences are observed within the ’insert region ’.
I RULE: <subj:structure>[X] vp:resemble & vp:resemble <dobj:C>[Y]I RELATION: resemble(’the overall structure’,’that’)
26 / 32
Learning to Extract Relations
Conclusion
Conclusion
I Proposed a methodology for developing IE systems withresources that can be provided by biologists.
I Learned relations as well as IE rules without annotatedcorpora.
I Annotated proteins with structured information (i.e.,predicate argument structure) in terms of any topic.
I Validated the methodology over different topics (function,structure, disease, cancer) in bio-medical domain.
I Advantage: will alleviate the burden of developing IEsystems for users who have little or no formal IE training.
27 / 32
Thank you for your attention!
Learning to Extract Relations
Appendix
For Further Reading
For Further Reading I
Mary Elaine Califf and Raymond J. Mooney.Bottom-up relational learning of pattern matching rules forinformation extraction.Journal of Machine Learning Research, 4:177–210, 2003.
R. Collier.Automatic template creation for information extraction, anoverview, 1996.
Dayne Freitag.Machine learning for information extraction in informaldomains.Machine Learning, 39(2/3):169–202, 2000.
29 / 32
Learning to Extract Relations
Appendix
For Further Reading
For Further Reading II
Takaaki Hasegawa, Satoshi Sekine, and Ralph Grishman.Discovering relations among named entities from largecorpora.In ACL, pages 415–422, 2004.
Ellen Riloff.Automatically generating extraction patterns from untaggedtext.In Proceedings of the Thirteenth National Conference onArtificial Intelligence (AAAI-96), pages 1044–1049, 1996.
30 / 32
Learning to Extract Relations
Appendix
For Further Reading
For Further Reading III
Jasmin Saric, Lars Juhl Jensen, Peer Bork, RossitzaOuzounova, and Isabel Rojas.Extracting regulatory gene expression networks frompubmed.In ACL, pages 191–198, 2004.
Stephen Soderland.Learning information extraction rules for semi-structuredand free text.Machine Learning, 34(1-3):233–272, 1999.
31 / 32
[Sod99][Fre00][CM03][Ril96][HSG04][Col96][SJB+04]
Recommended