Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends...

Preview:

Citation preview

Information Extraction

Sources:• Sarawagi, S. (2008). Information extraction.

Foundations and Trends in Databases, 1(3), 261–377. • Hobbs, J. R., & Riloff, E. (2010). Information extraction.

Handbook of Natural Language Processing, 2.

CONTEXT

History

• Genesis = recognition of named entities (organization & people names)

• Online access = pushes towards – personal desktops -> structured databases, – scientific publications -> structured records, – Internet -> structured fact finding queries.

Driving workshops / conferences

– 1987-97: MUC (Message Understanding Conference)Filling slots, named entities & coreference (95-)

– 1999-08: ACE (Automatic Content Extraction) « supporting various classification, filtering, and selection applications by extracting and representing language content »

– 2008-now: TAC (Text Automated Comprehension)• Knowledge Base Population (09-11)• Others: Textual entailment, Summarization, QA (until

2009)

Example: MUC0. MESSAGE: ID TST1-MUC3-00011. MESSAGE: TEMPLATE 12. INCIDENT: DATE 02 FEB 903. INCIDENT: LOCATION GUATEMALA: SANTO TOMAS (FARM)4. INCIDENT: TYPE ATTACK5. INCIDENT: STAGE OF EXECUTION ACCOMPLISHED6. INCIDENT: INSTRUMENT ID -7. INCIDENT: INSTRUMENT TYPE -8. PERP: INCIDENT CATEGORY TERRORIST ACT9. PERP: INDIVIDUAL ID "GUERRILLA COLUMN" / "GUERRILLAS"10. PERP: ORGANIZATION ID "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG"11. PERP: ORGANIZATION CONFIDENCE REPORTED AS FACT / CLAIMED OR ADMITTED: "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG"12. PHYS TGT: ID "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"13. PHYS TGT: TYPE GOVERNMENT OFFICE OR RESIDENCE: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"14. PHYS TGT: NUMBER 1: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"15. PHYS TGT: FOREIGN NATION -16. PHYS TGT: EFFECT OF INCIDENT -17. PHYS TGT: TOTAL NUMBER -18. HUM TGT: NAME "CEREZO"19. HUM TGT: DESCRIPTION "PRESIDENT": "CEREZO" "CIVILIAN"20. HUM TGT: TYPE GOVERNMENT OFFICIAL: "CEREZO" CIVILIAN: "CIVILIAN"21. HUM TGT: NUMBER 1: "CEREZO" 1: "CIVILIAN"22. HUM TGT: FOREIGN NATION -23. HUM TGT: EFFECT OF INCIDENT NO INJURY: "CEREZO" DEATH: "CIVILIAN"24. HUM TGT: TOTAL NUMBER -

Application• Enterprise Applications

– News Tracking (terrorists, disease)– Customer care (linking mails to products, etc.)– Data Cleaning– Classified Ads

• Personal Information Management (PIM)• Scientific Applications (e.g. bio-informatics)• Web Oriented

– Citation databases– Opinion databases– Community websites (DBLife, Rexa - UMASS)– Comparison Shopping– Ad Placement on Webpages – Structured Web Searches

IE - Taxonomy

• Types of structures extracted– Entities, Records, Relationships– Open/Closed IE

• Sources– Granularity of extraction– Heterogenity: machine generated, (semi)structured, open

• Input resources– Structured DB– Labelled Unstructured Text– Preprocessing (tokenizer, chunker, parser<)

Process (I)

• Annotated documents• Rules hand-crafted by humans (1500 hours!)

Process (I)

• Annotated documents• Rules hand-crafted by humans (1500 hours!)• Rules generated by a system• Rules evaluated by humans

Process (II)

• Annotated documents• Rules hand-crafted by humans (1500 hours!)

• Rules generated by a system• Rules learnt

Process (III)

• Annotated documents• Rules hand-crafted by humans (1500 hours!)

• Rules generated by a system

• Rules learnt• Models– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF

• Decomposition into a series of subproblems– Complex words, basic phrases, complex phrases, events and

merging

Process (IV)

• Annotated documents• Relevant & non relevant documents• Rules hand-crafted by humans (1500 hours!)

• Rules generated by a system

• Rules learnt• Models

– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF

Process (V)

• Annotated documents• Relevant & non relevant documents

• Seeds -> boostrapping• Rules hand-crafted by humans (1500 hours!)

• Rules generated by a system

• Rules learnt• Models

– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF

RECOGNIZING ENTITIES / FILLING SLOTS

Rule based systems

• Rules to mark an entity (or more)– Before the start of the entity– Tokens of the entity– After the end of the entity

• Rules to mark the boundaries• Conflicts between rules– Larger span– Merge (if same action)– Order the rules

Entity Extraction – rule based

Learning rules

• Algorithms are based on– Coverage [how many cases are covered by the

rule]– Precision

• Two approaches– Top-down (e.g. FOIL): start with coverage = 100%– Bottom-up: start with precision = 100%

Rules – Autoslog

• Rule Learning– Look at sentences containing targets– Heuristic: looking for a linguistic pattern

Riloff, E. (1993). Automatically constructing a dictionary for information extraction tasks, 811–811.

Rules – LIEPHuffman, S. B. (2005). Learning information extraction patterns from examples.

Learn (sets of meta-heuristics) by using syntactic paths that relate two role-filling constituents, e.g. [subject(Bob,named),object(named,CE0)].Followed by generalization (matching + disjonction)

Statistical models

• How to label– IOB sequences (Inside, Outside, Beginning)– Sequences– Segmentation

Alleged/B guerrilla/I urban/I commandos/I launched/O two/B highpower/I bombs/I against/O a/B car/I dealership/I in/O down- town/O San/B Salvador/I this/B morning/I.

– Grammar based (longer dependencies)• Many ML models:– HMM– ME, CRF– SVM

Statistical models (cont’d)

• Features– Word– Orthographic– Dictionary– …

• First order– Position:– Segment:

Examples of features

Statistical models (cont’d)

• Learning:– Likelihood

– Max-Margin

PREDICTING RELATIONSHIPS

Overall

• Goal: classify (E1,E2,x)• Features– Surface tokens (words, entities)

[Entity label of E1 = Person, Entity label of E2 = Location]

– Parse tree (syntaxic, dependency graph)[(POS = (noun,verb,noun), flag = “(1,none,2)”, type = “dependency”]

Models

• Standard classifier (e.g. SVM)• Kernel-based methods– e.g. measure of common properties between two

paths in the dependency tree– Convolution based kernels

• Rule-based methods

Extracting entities for a set of relationships

• Three steps– Learn extraction patterns for the seeds• Find documents where entities appear close to each

other• Filtering

– Generate candidate triplets• Pattern or keyword-based

– Validation• # of occurrences

MANAGEMENT

Summary

• Performance– Document selection: subset, crawling– Queries to DB: related entities (top-k retrieval)

• Handling changes– Detecting when a page has changed

• Integration– Detecting duplicates entities– Redundant extractions (open IE)

EVALUATION

Metrics

• Metrics– Precision-Recall– F-measure (-> harmonic mean)

The 60% barrier

Recommended