53
Information Extraction with UIMA - Use Cases Gestione delle Informazioni su Web - 2009/2010 Tommaso Teofili tommaso [at] apache [dot] org venerdì 16 aprile 2010

Information Extraction with UIMA - Usecases

Embed Size (px)

DESCRIPTION

Slides about "Usecases for Information Extraction with UIMA" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University

Citation preview

Page 1: Information Extraction with UIMA - Usecases

Information Extraction with UIMA - Use Cases

Gestione delle Informazioni su Web - 2009/2010Tommaso Teofili

tommaso [at] apache [dot] org

venerdì 16 aprile 2010

Page 2: Information Extraction with UIMA - Usecases

Use Cases - Agenda

UC1 : Real Estatate market analysis

UC2 : Tenders automatic information extraction

venerdì 16 aprile 2010

Page 3: Information Extraction with UIMA - Usecases

UC1 : Source

An online announcement site for sellers and buyers

Wide purpose (cars, RE, hi-fi, etc...)

Local scope (Rome and nearby)

venerdì 16 aprile 2010

Page 4: Information Extraction with UIMA - Usecases

UC1 - Goals

Are you looking for houses?

A specified subcategory of the site is dedicated to real estate

I would like to monitor Rome real estate market to

Take smart decisions

Predict how things will go in the (near) future

venerdì 16 aprile 2010

Page 5: Information Extraction with UIMA - Usecases

UC1 - Sourcevenerdì 16 aprile 2010

Page 6: Information Extraction with UIMA - Usecases

UC1 - Goals

I want to build a separate web application to monitor such estate listings

I have to use a crawler to automatically download selected pages periodically from the source

Estate listings text is unstructered

I want to make aggregate queries on structured information

venerdì 16 aprile 2010

Page 7: Information Extraction with UIMA - Usecases

UC1 - Information Extraction

I have to write an information extraction engine to populate a relational schema DB with structured information from free text of estate listings

venerdì 16 aprile 2010

Page 8: Information Extraction with UIMA - Usecases

UC1 - Blocksvenerdì 16 aprile 2010

Page 9: Information Extraction with UIMA - Usecases

UC1 - Crawler

A specialized crawler extract data from the source

Estate listings data are stored grouped by zones in files on some directory on a managed machine

venerdì 16 aprile 2010

Page 10: Information Extraction with UIMA - Usecases

UC1 - Crawler

Define navigation of the site using one XML for each city zone

The crawler downloads page fragments two times a week

The estate listings extracted free text is saved on XML grouped by zone

venerdì 16 aprile 2010

Page 11: Information Extraction with UIMA - Usecases

UC1 - Crawler Modulesvenerdì 16 aprile 2010

Page 12: Information Extraction with UIMA - Usecases

UC1 - navigation definitionvenerdì 16 aprile 2010

Page 13: Information Extraction with UIMA - Usecases

UC1 - Crawler

Issues :

Enabled cookies

Some HTTP headers needed

Needed to put fixed sleeping intervals between requests

venerdì 16 aprile 2010

Page 14: Information Extraction with UIMA - Usecases

UC1 - Domain

EstateListing (Announcement)

Zone

MagazineNumber (Uscita)

HouseStructure with properties

venerdì 16 aprile 2010

Page 15: Information Extraction with UIMA - Usecases

UC1 - Information Extraction Engine

Goal : extract price, zone and telephone number

The first version contained a specialized IE engine which used huge regular expressions

Hard to maintain and unefficient

Extracting not so much information

venerdì 16 aprile 2010

Page 16: Information Extraction with UIMA - Usecases

UC1 - IE Engine

New requirement: extract also the structure of the house

Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc...

Using again RegEx resulted to be hard to maintain and unefficient

venerdì 16 aprile 2010

Page 17: Information Extraction with UIMA - Usecases

UC1 - IE EngineSubsitute the RegEx based IE engine with a UIMA based IE engine to:

exploit previous work (RegExs can live inside UIMA too)

exploit existing components

be able to modify and enhanche IE rules easily

much more efficient

more information extracted

venerdì 16 aprile 2010

Page 18: Information Extraction with UIMA - Usecases

UC1 - Analysis pipelinevenerdì 16 aprile 2010

Page 19: Information Extraction with UIMA - Usecases

UC1 - TypeSystemvenerdì 16 aprile 2010

Page 20: Information Extraction with UIMA - Usecases

Crawled XMLvenerdì 16 aprile 2010

Page 21: Information Extraction with UIMA - Usecases

Sample text

“ven 26 Dic APPIA via grottaferrata metro 2 ¡ piano assolato ingresso salone americana cucina camera cameretta bagno soppalco posto auto e 295.000”

venerdì 16 aprile 2010

Page 22: Information Extraction with UIMA - Usecases

UC1 - ContentAnnotator

From the XML produced by the crawler only estate listings must be extracted

A simple parser to get each node containing an estate listing (that in turn will be unstructured)

Create a ContentAnnotation over the document

venerdì 16 aprile 2010

Page 23: Information Extraction with UIMA - Usecases

UC1 - ContentAnnotatorvenerdì 16 aprile 2010

Page 24: Information Extraction with UIMA - Usecases

ContentAnnotationvenerdì 16 aprile 2010

Page 25: Information Extraction with UIMA - Usecases

UC1 - ACAnnotatorvenerdì 16 aprile 2010

Page 26: Information Extraction with UIMA - Usecases

UC1 - Entitiesvenerdì 16 aprile 2010

Page 27: Information Extraction with UIMA - Usecases

ZoneAnnotator - Dictionary & RegEx

venerdì 16 aprile 2010

Page 28: Information Extraction with UIMA - Usecases

ZoneAnnotator - Learning dictionaries

venerdì 16 aprile 2010

Page 29: Information Extraction with UIMA - Usecases

UC1 - ZoneAnnotationvenerdì 16 aprile 2010

Page 30: Information Extraction with UIMA - Usecases

UC1 - Consuming extracted information

the previous version of the IE engine produced (again) XMLs that needed to be parsed to store structured data inside the DB

with UIMA a CAS Consumer at the end of the analysis pipeline can automatically put extracted information on the DB

venerdì 16 aprile 2010

Page 31: Information Extraction with UIMA - Usecases

UC1 - Analyzing real estate market data

a simple webapp written in Java with Spring framework modules (Spring core, DAO, JDBC, MVC) querying aggregate data on MySQL DB

enriched UI with JQuery

venerdì 16 aprile 2010

Page 32: Information Extraction with UIMA - Usecases

UC1 - Analysis Graphsvenerdì 16 aprile 2010

Page 33: Information Extraction with UIMA - Usecases

UC1 - Analysis Graphsvenerdì 16 aprile 2010

Page 34: Information Extraction with UIMA - Usecases

UC2 - Monitor of tenders/announcements

Monitor various sources which provide announcement and tenders to which people and companies are interested can subscribe

We want to automate the long monitoring process of such sources and also automatically extract useful common information from announcements’ text

venerdì 16 aprile 2010

Page 35: Information Extraction with UIMA - Usecases

UC2 - Blocksvenerdì 16 aprile 2010

Page 36: Information Extraction with UIMA - Usecases

Different input textsvenerdì 16 aprile 2010

Page 37: Information Extraction with UIMA - Usecases

Different input textsvenerdì 16 aprile 2010

Page 38: Information Extraction with UIMA - Usecases

Different input textsvenerdì 16 aprile 2010

Page 39: Information Extraction with UIMA - Usecases

Different input textsvenerdì 16 aprile 2010

Page 40: Information Extraction with UIMA - Usecases

UC2 - Crawling

Similar to UC1 Crawler but using a Firefox plugin we can define navigation patterns for pages of each source

We can also define metadata we see during navigation that deliver information

Again an XML will be generated so that it can be saved on a storage and executed periodically

venerdì 16 aprile 2010

Page 41: Information Extraction with UIMA - Usecases

UC2 - Defining navigationvenerdì 16 aprile 2010

Page 42: Information Extraction with UIMA - Usecases

UC2 - Domain annotations

Language

Abstract

Activity

Beneficiary

Budget

Expiration date

Funding type

Geographic region

Sector

Subject

Title

Tags

venerdì 16 aprile 2010

Page 43: Information Extraction with UIMA - Usecases

UC2 - Domain entities

First and most important is an entity that represents the entire tender or announcement

Annotations inside the domain will finally fill such entity properties

venerdì 16 aprile 2010

Page 44: Information Extraction with UIMA - Usecases

UC2 - Pipelinevenerdì 16 aprile 2010

Page 45: Information Extraction with UIMA - Usecases

Each annotator first looks:

if some metadata was extracted during navigation

for the most common pattern for defining information inside such announcements

i.e.: “Budget: 200000$” or “Financial information: ......”

Such patterns are language independent (although this is often not true)

UC2 - Simple first

venerdì 16 aprile 2010

Page 46: Information Extraction with UIMA - Usecases

UC2 - AbstractAnnotator

The abstract is usually in the first part of the document

We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences

We use Dictionary to provide a list of “good” words

We look in the first sentences of the document looking for objectives of the announcement (mixing good words and regular expressions)

venerdì 16 aprile 2010

Page 47: Information Extraction with UIMA - Usecases

UC2 - ExpirationDateAnnotator

A DateAnnotator is executed before

Iterate over DateAnnotations

Get sentences wrapping such DateAnnotations

Check if some terms like “deadline” appear in the same sentence of a DateAnnotation

venerdì 16 aprile 2010

Page 48: Information Extraction with UIMA - Usecases

Date patternsvenerdì 16 aprile 2010

Page 49: Information Extraction with UIMA - Usecases

ExpirationDateAnnotatorvenerdì 16 aprile 2010

Page 50: Information Extraction with UIMA - Usecases

GeographicRegionAnnotatorvenerdì 16 aprile 2010

Page 51: Information Extraction with UIMA - Usecases

UC2 - ActivityAnnotatorvenerdì 16 aprile 2010

Page 52: Information Extraction with UIMA - Usecases

UC2 - ActivityAnnotatorvenerdì 16 aprile 2010

Page 53: Information Extraction with UIMA - Usecases

Conclusions on IE

UC1 : simple and stable sentence patterns

UC2 : multi language, much more complex and different sentence structures and patterns

Fine grain metadata are very important

Need to play with NLP

Need to establish good test cases

venerdì 16 aprile 2010