Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510...

Preview:

Citation preview

Processing Complex Sentences for Information Extraction

Deepthi Chidambaram

December 22, 2004BY 510

Committee

Dr. Hasan Davulcu

Dr. Chitta Baral

Dr. Yoganand Balagurunathan

Outline

• Information extraction – motivation.• Current Trends in Information Extraction.• Sentence structures.• Complex Sentence Processing

– A Rule Engineering Approach.– An Automated Approach

• Experimental Results.• Conclusion.

Information Extraction - Motivation

• Enormous amount of knowledge stored in unstructured text.

• Lack of easy access to information for processing.• Biomedical domain - Piping hot results that curated

databases don’t offer.

What can we get from text

• Literature– Protein functional information.– Interactions between genes, proteins, chemicals and

drugs.– Signal transduction pathways.

• Clinical records– Diagnosis patterns.– Drugs to disease association.

The information extraction process

Text Retrieval

• Keyword retrieval (PubMed)

• Vector-Space Model

• Probabilistic models

• Latent Semantic Indexing

Text Classification

• Knowledge Engineering– Rules specified by a domain expert.– ‘knowledge acquisition bottleneck’.

• Machine Learning– ‘Hard’ and ‘soft’ classification.– Traditional classifiers with automatic feature selection

algorithms.– Frequent word co-occurrences and/or Natural Language

patterns.– Ontologies to identify word synonyms.

Named Entity Recognition

• Dictionary look up.– LocusLink, Gene Ontology, HUGO, SwissProt and

UMLS provide extensive lists of biomedical terms.– Constrained to identify names present in the

vocabularies alone.

• Regular Expressions.– Patterns defined on words and phrases.– E.g.. Alphanumeric words, hyphenated mixed cases.

Named Entity Recognition

• Boosted Wrapper Induction– Name boundaries to identify entities.

• Supervised Classifiers– Hidden Markov Models.– Support Vector Machines.

• Supervised Rule Learners to tag entity names – the RAPIER system– Rules based on word patterns.

• Probabilistic Rule Learner – ABGene– Rules based on co-occurrence of POS tags and neighboring words.

Named Entity Recognition

Results from Shared task in JNLPBA, 2004

(Trained and tested on GENIA corpus)

 Technique used Precision / Recall / F-measure

Combination of Hidden Markov Models and Support Vector Machines

76.0 / 69.4 / 72.6

Classifier trained on Local and syntactic features of entity names 71.6 / 68.6 / 70.1

Conditional Random Fields (CRF) and Novel Feature Sets 70.3 / 69.3 / 69.8

SVM + CRF 67.8 / 64.8 / 66.3

HMM model 69.1 / 61.0 / 64.8

Training SVM classifiers on linguistic knowledge 67.4 / 61.0 / 64.0

Extracting Relationships

• Word co-occurrences– Presence of a functional word and entity

names in same sentence.

• Regular Expressions – Regular expressions defined on part of speech

tags.– Processing complex sentence structures using

regular expressions.

Extracting Relationships

• Grammar rules based on sentence structures – shallow parses.– Context Free grammars. – Probabilistic Context Free Grammars. – Automatic learning of grammar rules using supervised rule

learners – FOIL, PROGOL.– Combination of Naïve Bayes Classifier and FOIL.

Extracting Relationships

• Hidden Markov Models– Learning templates based on words and sentence structures.

• Preposition based templates – the Arizona Relation parser– Slot filler approach.– Shallow parsing of sentence structure.

• Decision trees - GIS– Wording and term distributions represented as a decision tree to

identify protein interactions.

Extracting relationships

• Manually built templates– SUISEKI - Frame based system.– BioRAT - Templates based on sentence structure and semantics.– GENIES, GeneWays – Extension of the MedLEE frame based

system.

• Full sentence parses– The structure of the entire sentence is considered.– Potentially more accurate.– Rule learners based on Link grammar analysis of sentences.– Dependency sub-graph from the link grammar word-dependency

output.

Evaluation of Extraction Systems

• Precision– Correctness of system

True Positives True Positives + False Positives

• Recall– Sensitivity of system

True PositivesTrue Positive + False Negative

• F-measure– Combination of Precision and Recall (harmonic mean)

2 * P * R(P + R)

Common problems

• Complex sentence structures.c-Abl tyrosine kinase activity is blocked by pRb, which binds to the c-Abl kinase domain.

• Contextual information.c-Abl phosphorylates tyrosines in the C-terminal domain (CTD) of RNA polymerase II (RPase II; Km 5 0.5 mM).

Common problems

• Negation.DNA-PK does not bind detectably to Ku in the absence of DNA.

• Modality and hypothesis.Cdk4 may be inhibited by tyrosine phosphorylation.

Gadd45 binds to Cdk1 and Gadd45 inhibits Cdk1 activity, probably by displacing Cyclin B1.

Complex Sentence Processing - Motivation

• Sentences structures in text are seldom simple.

• Simple, Complex, compound and Complex-compound structures are addressed.

• Dependent and independent clauses.

• Multiple clauses used to specify interactions.

• Multiple interactions specified in clauses.

Sentence Structures - Examples

• SimplePhosphorylation by ATM activates c-Abl.

• Complexc-Abl tyrosine kinase activity is blocked by pRb, which binds to the c-Abl kinase domain.

• Compoundc-Abl phosphorylates tyrosines in the C-terminal domain (CTD) of RNA polymerase II (RPase II; Km 5 0.5 mM) ; the c-Abl SH2 domain is a specificity determinant for this reaction.

• Complex-compoundBecause the same signals that induce somitic expression of Pax-3 also induce myogenesis and because expression of Pax-3 precedes that of MyoD and Myf-5 in vivo and that of MyoD in vitro, we investigated whether Pax-3 might play a role in activating the myogenic program in somitic cell.

Complex Sentence Processing

• Manual Rule Engineering approach– Based on Part of Speech tags.– Domain expert engineers rules for complex sentence

structures.

• Automated approach– Based on syntactic relationships between the words in

a sentence.– Link grammar parser.

Evaluation of Our Systems

• PrecisionNo. of Correct interactions

Total no. of interactions extracted

• RecallNo of correct interactions

Total no of interactions identified in text

• F-measure(2*P*R)/P+R

Rule Engineering Approach

• Patterns learnt by interaction with domain expert.

• Sentences tagged by a natural Language processor.

• Extraction rules written based on Part of speech tags.

• False positives reduced by use of Ontologies.

Rule Engineering Approach

Toufeeq

Rule Engineering Approach

• System developed in XSB prolog.

• A Java GUI integrated to the back-end processing system.

• Dictionaries coded as knowledge bases in prolog.

• Input – Sentences in a text file.

• Ouput – An interaction triplet.– interact (upstream-gene, interaction word,

downstream-gene).

Pos patterns for simple sentences

extract([word([tag = NNP],_h13160),word([tag = VBZ],_h13161), word([tag=JJ],_h13162)], interact(_h13160,_h13161,_h13162),true]).

extract([ng(_h99513),vg(_h99514),ng(_h99515)],

interact(_h99513,_h99514,_h99515),true).

extract([ng(_h108321),vg(_h108322),word([tag=NNP],_h108323)],

interact(_h108321,_h108322,_h108323),true).

Extract an interaction from a (Noun, Verb, Adjective) pattern.

Knowledge Bases Used

• LocusLink (NCBI)– compilation of the genes and related information such as alias names, protein

products, PubMed IDs of articles referring to the gene etc.– Gene, protein names and their aliases.

• UMLS (NLM)– Electronic "Knowledge Sources" and associated lexical programs for biological

domain.– Specialist Lexicon used to extract interaction words.

• WordNet– English nouns, verbs, adjectives and adverbs are organized into synonym sets,

each representing one underlying lexical concept. – Used in conjunction with UMLS to extract interaction words.– Interaction words are stemmed using the Porter Stemming algorithm to match

across numbers and tense.

Representation of Knowledge Bases

%gene and protein namesisa('3''-nucleotidase',gene).isa('3'' repair exonuclease 1',protein).isa('E2F4',gene).isa('5''-nucleotidase, cytosolic IA',protein).isa('M1',gene).isa('PP2A, subunit B''',protein)

% interaction wordsinteraction('accumulat').interaction('activat').interaction('elevat').interaction('hasten').interaction('incite').interaction('increas').interaction('induc').

Complex Sentence Processing

• Extraction Methodology– Identify the Complex Sentence patterns.– Split the complex and compound sentences to simple

sentences based on extraction rules for the pattern.– Process the sub-sentences using pattern matching

against simple sentence patterns.

• An iterative process was used to build the rules and add simple sentence patterns manually.

• Training and Testing – curated text from a 1999 Molecular Biology of the Cell article by Kurtis Kohn.

Rules Written to Address Complex Sentence Structures

Case - List of arguments - a comma separated list or a list separated by conjunctions.

Rule - Sentence is rewritten with each of the arguments. Each rewrite is processed as a single sentence.

Sample Input- Dyhydrofolate reductase (DHFR) is activated via the E2F transactivation domain, whereas B-myb, Cyclin E, E2F-1, E2F-2, and Cdc2 are regulated via the repression domain of pRb family proteins.

Sample output - interact([Dyhydrofolate,reductase],[is,activated],

[the,E2F,transactivation,domain],1.0).interact(B-myb,[are,regulated],[the,repression,domain],0.6667).interact([Cyclin,E],[are,regulated],[the,repression,domain],0.6667).interact(E2F-1,[are,regulated],[the,repression,domain],0.6667).interact(E2F-2,[are,regulated],[the,repression,domain],0.6667).interact(Cdc2,[are,regulated],[the,repression,domain],0.6667).

Rules – Continued

extractListStart([],_).extractListStart([X|Y],Part) :-

conc([X],Part,Whole),extractComplex(Whole),extractListStart(Y,Part).

extractComplex(S):-

contains([W1,word([_],','),_,word([_],','),_],S),splitList([W1],S,S1,S2),S1=[],getList(S2,List),length(List,Len),ith(Len,List,Ele),sublist([Ele],S,Part1,Part2),extractListStart(List,Part2),!.

Snapshot of the system GUI

Sample Run of the system

Results

Training data Testing data Precision (%)

Recall

(%)Parts A, B Parts A, B 98 88

Parts A, B Part C 57 43

Parts A, B, C Part E 81.25 71

Discussion

• Pronoun resolution done by hand.

• Dependent on skills of domain expert.

• Extension to other sentence structures will involve re-engineering rules.

• Recall tends to be higher as rules are manually created.

Automated Extraction System

• Automatic Integrated system for – Pronoun resolution.– Entity Tagging.– Complex sentence processing.– Interaction Extraction.

• Uses dependencies generated by the Link grammar parser

• Scalable• Less error prone

Automated Approach to CSP

Anaphora Resolution

• Anaphora Resolution– Identification of pairs of noun phrases that refer to

each other.

• Need for Anaphora resolution– Entities are often referred to by pronouns and referent

noun phrases.– Identifying the noun phrases referred to improves

recall of the system.

Types of Anaphora in Literature

Pronoun Resolution Module

• Identifies and resolves pronominal anaphora.

• System written in XSB prolog.

• Interprolog Java API is used to integrate the module to the extraction system.

• Addresses third person pronouns such as it, they.

• Handles both singular and plural pronouns as well as reflexives (itself, themselves) and possessives

(its, their).

Pronoun Resolution - Algorithm

1. Process sentences to get POS tags for the words.

2. Perform chunking on the words to form noun phrases.

3. Identify pronouns in the sentence

4. Remove redundant reflexives

(Gene A, by itself remains activated.)

5. Replace identified pronouns with noun phrases matching number and voice.

Pronoun Resolution: Walkthrough

Ku loads onto dsDNA ends and it can diffuse along the DNA in an energy-independent manner.

Ku loads onto dsDNA ends and Ku can diffuse along the DNA in an

energy-independent manner.

When breast cancers were examined for NGAL mRNA and protein levels, they were found to exhibit heterogeneous expression.

When breast cancers were examined for NGAL mRNA and protein levels , breast cancers were found to exhibit heterogeneous expression .

Pronoun Resolution: Example 2

Entity Tagging

• Combination of dictionary approach and heuristic rules.

• Dictionaries built from LocusLink, UMLS, GO and WordNet.

• Entities addressed by dictionaries - genes, proteins, chemicals, location and interaction words.

• Regular expression developed manually using the information from the dictionaries.

• Heuristic rules combine regular expression and POS tags to mark entities.

Gene and Protein Name Dictionary

• Source - NCBI’s LocusLink.• Fields in LocusLink contain information about gene

loci.– Official nomenclature, aliases, sequence accessions, phenotypes, EC numbers,

MIM numbers, UniGene clusters, homology, map locations, related web sites, etc.

• Perl parser written to identify gene and protein names and their aliases.

• Names and aliases maintained in a list loaded by the system during processing.

Gene and Protein Name Dictionary

(a) LocusLink fields considered for extraction (b) Partial LocusLink record (c) Regular expression used for gene name matching

Gene and Protein Name Tagging

• Exact matches with dictionary• Process of relaxation

– Mark the entire noun group as a gene if any of the word has been identified as gene / protein name.

‘The c-abl kinase activity’ is a gene type

• Words that match a manually built regular expression.– POS tags are used to exclude false positives such as adverbs,

modals, prepositions, etc.

• 508,477 entries in dictionary

Chemical and Location Names

• Used to tag location of interaction and agents in interaction for modifier analysis.

• Source - UMLS Metathesaurus and Semantic Network, GO.

• Location names dictionary contains the GO cellular components ontology.

• Semantic types identified from Semantic Network using the ‘isa’ relationship.

Chemical and Location Tagging

• Concepts categorized under the identified semantic types are added to the corresponding dictionaries.

• A perfect match with an entry in the dictionary is tagged as chemical name / cellular location.

• 790,000 chemical names and 118, 018 location names were identified.

Interaction Words

• Source – Verbs from UMLS Specialist Lexicon and WordNet.

• Interaction words are identified as– (Specialist verbs ∩ WordNet verbs)

• Missing interaction words are added manually.

• 1400 verbs in their stemmed form are stored in the list.

• Porter Stemming Algorithm allows matches across tense and number.

Complex Sentence Processor (CSP)

• Tool used – Link Grammar Parser (LGP) from CMU.

• Verb based extraction.– John played the pipes.

• Linkages returned by the link grammar used to identify syntactic constructs.

• Syntactic constructs associated with a verb form simple sentences.

• Internal clause format representation.

• Java system with a GUI.

Link Grammar Parser

• Based on the Link Grammar for English Syntax.

• Dependency based grammar system written in C.

• Every word in a sentence is connected to every other word directly or indirectly through links.

• Dictionary specifies linking requirements for each word.

• Linkages in a sentence should satisfy– Connectivity.– Planarity.

Linking Requirement

• Link Grammar Parser (LGP) trained on sentences from conversational English (70 telephone conversations).

• Linking requirements specified using ‘&’ and ‘or’ operators.

the: ({AL-} & {@L+} & (D+ or DD+)) or DG+ or (TR- & U+);

• Each word should follow its linking requirement.

Linkage

• Words in sentence matched against dictionary– Linkage formed using the satisfied linking requirements.

• Null linkages – ambiguous / unknown words– LG ignores some words if an acceptable guess of the requirement is

not found.

• Linkages returned in increasing order of cost.

Using Link Grammar Parser in CSP

• LGP considers each sentence as a having one verb.

• Provides sub-linkages for each clause in complex sentence.

• Sub linkages are combined to a single linkage using !union option.

• Combined linkage analyzed and split into clauses.

Combined sub-linkages - !union operator

Pre-processing Text for Link Grammar

• Gene names are not in the vocabulary of LG.– Force LGP to identify entities as nouns.– Capitalized and hyphenated phrases with entity tags.– E.g. ‘The C-abl protein kinase activity’ is converted to

‘Gene-The-C-abl-protein-kinase-activity’.

• Phrases / words in parentheses cause incorrect linkages.– Remove words within parentheses.

Preprocessing – removing parentheses

Rapid degradation of Cyclin D1 requires phosphorylation at threonine-286 (kinase unknown, but not Cdk2 or Cdk4); degradation is by way of the ubiquitin-proteasome pathway

Sentence crashes link grammar if processed as is, with parentheses.

Pre-processing Text for Link Grammar

• Windows version of Link Grammar Parser– Does not handle sentences of more than 70 words

• Mark the sentences of more than 70 words and discard them for further processing

– Words of more than 50 characters also crash the LGP• Mark sentences and discard them from further processing

• Sentences without two participants and one interaction word do not contain interactions– Discard sentences to improve processing time– Discarding is done after pronoun resolution so

referrals are not disturbed.

Complex Sentence Processing

• Input– Tagged and pre-processed text.

• Output– Clauses from the sentence representing the simple

sentences.

• Internal clause format representationSubject + Verb + Object (s) + Verb Modifying Phrases (s)

• Simple Sentences processed by Interaction Extractor module to get interactions from text.

Complex Sentence Processor- Goal

Kohn-partC3|2|Another DMP1-regulated gene is CD13/aminopeptidase N, which is activated cooperatively by DMP1 and c-Myb; CD13/aminopeptidase N activation by DMP1 is inhibited by cyclin D, independent of Cdk4/6.

Complex Sentence Processing

Kohn-partC3|2|Gene-Another-DMP1-regulated-gene|is|Gene-CD13/aminopeptidase-N , which#||Kohn-partC3|2|Gene-CD13/aminopeptidase-N |is activated||cooperatively#by Gene-DMP1 and Gene-c-Myb#|Kohn-partC3|2|Gene-N-activation by Gene-DMP1|is inhibited||by Gene-cyclin-D , independent of Gene-Cdk4/6#|

Abstract ID | Sentence ID | Subject | Verb | Objects(s) | Verb Modifier(s)

Our Contribution: CSP Algorithm

1.Processing complex sentences - Set ranges in sentence1. Identify the S links in the sentence. Each S link forms the beginning

of a simple sentence.

2. Identify the Components from each S link and expand the range to include prepositional, adverbial and adjectival phrases.

2.Extract the components of the simple sentence from the original sentence using the ranges as the substring markers.

CSP - Illustration

Upon growth factor stimulation of quiescent cells, p130 declines late in G1 and p130 is replaced by p107, which is absent in quiescent cells.

Kohn-partE2|8|Upon growth factor stimulation of quiescent cells, Gene-p130 declines late in Gene-G1 and Gene-p130 is replaced by Gene-p107, which is absent in quiescent cells. |

Entity Tagging and Pre-processing

Illustration of the CSP algorithm

Kohn-partE2|8| upon growth factor stimulation of quiescent cells, Gene-p130 |declines || late # in Gene-G1 #|

Kohn-partE2|8| Gene-p130 | is replaced || by Gene-p107 #|

Kohn-partE2|8| Gene-p107 | is absent || in quiescent cells#|

CSP Example – Pronoun resolution, preprocessing

Experimental results

• System compared with two other systems– BioRAT [Corney et al, 2004]

• Uses manually generated sentence structure templates.

• Partial parses of sentences.

– GeneWays [Rzhetsky et al, 2004]• An improvement on GENIES system.

• Extraction rules are semantic grammars defined on the sentence structures

• Partial parses of the sentences.

Evaluation - BioRAT

• DIP entries with both participants present in SwissProt.

• 229 abstracts, 660 entries in DIP.

• Gene name dictionary restricted to SwissProt names to emulate the BioRAT system.

• 452 interactions obtained.

• Precision of 54.39%

• Recall of 19.98%.

• F-measure of 29.18%

Comparison with BioRAT

Comparison with BioRAT and DIP was undertaken as a combined work with Toufeeq

Comparison with DIP

Difference in Recall - the interaction is not mentioned in the abstract. (Manual and semi-automatic curation of entries in DIP)

Comparison with GeneWays

• 2500 interactions obtained from the authors corresponding to more than 10 citations in PubMed.

• 30 interactions with most citations were selected and PubMed abstracts (495) obtained through NCBI’s eutilities facility corresponding to the entities.

• Extraction system obtained 1485 interactions.• 156 interactions occurred more than twice in the

abstracts with Precision of 76.13 %.• Recall performed on a single full text article quoted in

GENIES. • 42 interactions extracted from the article with a recall of

63.64%.

Comparison with GeneWays

Analysis of errors in comparison with GeneWays

Conclusion

• Extraction system is able to extract multiple interactions mentioned through complex sentence structure.

• Limitations and bottlenecks– Protein name identification.– Link Grammar Parsing problems.– Identification of relationships between clauses.

Thank you

• My advisors – Dr. Davulcu, Dr. Baral for their guidance and encouragement

• Yoga, for the periodic inputs and pointers• Dr. David Corney , for sharing the data from BioRAT• Dr. Andrew Rzhetsky for providing data from the

GeneWays system• Toufeeq, for sharing his code and for his contribution in

evaluation.

Questions

Recommended