Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends...

Information Extraction

Sources:• Sarawagi, S. (2008). Information extraction.

Foundations and Trends in Databases, 1(3), 261–377. • Hobbs, J. R., & Riloff, E. (2010). Information extraction.

Handbook of Natural Language Processing, 2.

CONTEXT

History

• Genesis = recognition of named entities (organization & people names)

• Online access = pushes towards – personal desktops -> structured databases, – scientific publications -> structured records, – Internet -> structured fact finding queries.

Driving workshops / conferences

– 1987-97: MUC (Message Understanding Conference)Filling slots, named entities & coreference (95-)

– 1999-08: ACE (Automatic Content Extraction) « supporting various classification, filtering, and selection applications by extracting and representing language content »

– 2008-now: TAC (Text Automated Comprehension)• Knowledge Base Population (09-11)• Others: Textual entailment, Summarization, QA (until

Example: MUC0. MESSAGE: ID TST1-MUC3-00011. MESSAGE: TEMPLATE 12. INCIDENT: DATE 02 FEB 903. INCIDENT: LOCATION GUATEMALA: SANTO TOMAS (FARM)4. INCIDENT: TYPE ATTACK5. INCIDENT: STAGE OF EXECUTION ACCOMPLISHED6. INCIDENT: INSTRUMENT ID -7. INCIDENT: INSTRUMENT TYPE -8. PERP: INCIDENT CATEGORY TERRORIST ACT9. PERP: INDIVIDUAL ID "GUERRILLA COLUMN" / "GUERRILLAS"10. PERP: ORGANIZATION ID "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG"11. PERP: ORGANIZATION CONFIDENCE REPORTED AS FACT / CLAIMED OR ADMITTED: "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG"12. PHYS TGT: ID "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"13. PHYS TGT: TYPE GOVERNMENT OFFICE OR RESIDENCE: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"14. PHYS TGT: NUMBER 1: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"15. PHYS TGT: FOREIGN NATION -16. PHYS TGT: EFFECT OF INCIDENT -17. PHYS TGT: TOTAL NUMBER -18. HUM TGT: NAME "CEREZO"19. HUM TGT: DESCRIPTION "PRESIDENT": "CEREZO" "CIVILIAN"20. HUM TGT: TYPE GOVERNMENT OFFICIAL: "CEREZO" CIVILIAN: "CIVILIAN"21. HUM TGT: NUMBER 1: "CEREZO" 1: "CIVILIAN"22. HUM TGT: FOREIGN NATION -23. HUM TGT: EFFECT OF INCIDENT NO INJURY: "CEREZO" DEATH: "CIVILIAN"24. HUM TGT: TOTAL NUMBER -

Application• Enterprise Applications

– News Tracking (terrorists, disease)– Customer care (linking mails to products, etc.)– Data Cleaning– Classified Ads

• Personal Information Management (PIM)• Scientific Applications (e.g. bio-informatics)• Web Oriented

– Citation databases– Opinion databases– Community websites (DBLife, Rexa - UMASS)– Comparison Shopping– Ad Placement on Webpages – Structured Web Searches

IE - Taxonomy

• Types of structures extracted– Entities, Records, Relationships– Open/Closed IE

• Sources– Granularity of extraction– Heterogenity: machine generated, (semi)structured, open

• Input resources– Structured DB– Labelled Unstructured Text– Preprocessing (tokenizer, chunker, parser<)

Process (I)

• Annotated documents• Rules hand-crafted by humans (1500 hours!)

Process (I)

• Annotated documents• Rules hand-crafted by humans (1500 hours!)• Rules generated by a system• Rules evaluated by humans

Process (II)

• Rules generated by a system• Rules learnt

Process (III)

• Rules generated by a system

• Rules learnt• Models– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF

• Decomposition into a series of subproblems– Complex words, basic phrases, complex phrases, events and

merging

Process (IV)

• Annotated documents• Relevant & non relevant documents• Rules hand-crafted by humans (1500 hours!)

• Rules learnt• Models

– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF

Process (V)

• Annotated documents• Relevant & non relevant documents

• Seeds -> boostrapping• Rules hand-crafted by humans (1500 hours!)

• Rules learnt• Models

– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF

RECOGNIZING ENTITIES / FILLING SLOTS

Rule based systems

• Rules to mark an entity (or more)– Before the start of the entity– Tokens of the entity– After the end of the entity

• Rules to mark the boundaries• Conflicts between rules– Larger span– Merge (if same action)– Order the rules

Entity Extraction – rule based

Learning rules

• Algorithms are based on– Coverage [how many cases are covered by the

rule]– Precision

• Two approaches– Top-down (e.g. FOIL): start with coverage = 100%– Bottom-up: start with precision = 100%

Rules – Autoslog

• Rule Learning– Look at sentences containing targets– Heuristic: looking for a linguistic pattern

Riloff, E. (1993). Automatically constructing a dictionary for information extraction tasks, 811–811.

Rules – LIEPHuffman, S. B. (2005). Learning information extraction patterns from examples.

Learn (sets of meta-heuristics) by using syntactic paths that relate two role-filling constituents, e.g. [subject(Bob,named),object(named,CE0)].Followed by generalization (matching + disjonction)

Statistical models

• How to label– IOB sequences (Inside, Outside, Beginning)– Sequences– Segmentation

Alleged/B guerrilla/I urban/I commandos/I launched/O two/B highpower/I bombs/I against/O a/B car/I dealership/I in/O down- town/O San/B Salvador/I this/B morning/I.

– Grammar based (longer dependencies)• Many ML models:– HMM– ME, CRF– SVM

Statistical models (cont’d)

• Features– Word– Orthographic– Dictionary– …

• First order– Position:– Segment:

Examples of features

Statistical models (cont’d)

• Learning:– Likelihood

– Max-Margin

PREDICTING RELATIONSHIPS

Overall

• Goal: classify (E1,E2,x)• Features– Surface tokens (words, entities)

[Entity label of E1 = Person, Entity label of E2 = Location]

– Parse tree (syntaxic, dependency graph)[(POS = (noun,verb,noun), flag = “(1,none,2)”, type = “dependency”]

Models

• Standard classifier (e.g. SVM)• Kernel-based methods– e.g. measure of common properties between two

paths in the dependency tree– Convolution based kernels

• Rule-based methods

Extracting entities for a set of relationships

• Three steps– Learn extraction patterns for the seeds• Find documents where entities appear close to each

other• Filtering

– Generate candidate triplets• Pattern or keyword-based

– Validation• # of occurrences

MANAGEMENT

Summary

• Performance– Document selection: subset, crawling– Queries to DB: related entities (top-k retrieval)

• Handling changes– Detecting when a page has changed

• Integration– Detecting duplicates entities– Redundant extractions (open IE)

EVALUATION

Metrics

• Metrics– Precision-Recall– F-measure (-> harmonic mean)

The 60% barrier

Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends...

Documents

Tutorial on Data Mining Workshop of the Indian Database Research Community Sunita Sarawagi School of IT, IIT Bombay

IDENTIFYING REDUCED PASSIVE VOICE CONSTRUCTIONS …riloff/pdfs/Official-Igo-MSThesis.pdf · IDENTIFYING REDUCED PASSIVE VOICE CONSTRUCTIONS IN SHALLOW PARSING ENVIRONMENTS by

Sunita Sarawagi IIT Bombay it.iitb.ernet/~sunita

Graphical models for structure extraction and information integration Sunita Sarawagi IIT Bombay sunita

EXTRACTION CHROMATOGRAPHY Technical ... › scripts › files › 5ee8...Extraction chromatographic resins EXTRACTION CHROMATOGRAPHY Technical Documentation - All Resins Extraction

Creating Subjective and Objective Sentence Classifiers from Unannotated Texts Ellen Riloff University of Utah (Joint work with Janyce Wiebe at the University

1 Mining surprising patterns using temporal description length Soumen Chakrabarti (IIT Bombay) Sunita Sarawagi (IIT Bombay) Byron Dom (IBM Almaden)

Widening the Field of View of Information Extraction …riloff/pdfs/Patwardhan-PhD...Event-based Information Extraction (IE) is the task of identifying entities that play speciﬁc

microwaveassisted extraction antioxidant extraction using

Desiderata for Annotating Data to Train and Evaluate Bootstrapping Algorithms Ellen Riloff School of Computing University of Utah

Information Extraction: Methodologies and …...rank function to rank the patterns. In the end, only the top patterns are kept in the dictionary. Riloff et al (1999) propose using

Extraction - behr Labor-Technik GmbHpartners.laborbehr.com/pdf/37/behr_labor_technik_extraction_en.pdf · extraction process Extraction processes (more precisely: solid-liquid extraction

1 Scaling multi-class Support Vector Machines using inter- class confusion Author:Shantanu Sunita Sarawagi Sunita Sarawagi Soumen Chakrabarti Soumen Chakrabarti

Structured learning Sunita Sarawagi IIT Bombay sunita TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Hobbs and Riloff, 2010; - Utah :: School of Computing

Sunita Sarawagi Sunita@iitb.ac.in Data mining and Machine Learning

Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay

1 Information Extraction using HMMs Sunita Sarawagi

Freecharge by Naman Sarawagi

Data warehousing, data analysis and OLAP Sunita Sarawagi sunita@iitb.ac.in