Upload
tommaso-teofili
View
3.586
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Slides about "Usecases for Information Extraction with UIMA" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University
Citation preview
Information Extraction with UIMA - Use Cases
Gestione delle Informazioni su Web - 2009/2010Tommaso Teofili
tommaso [at] apache [dot] org
venerdì 16 aprile 2010
Use Cases - Agenda
UC1 : Real Estatate market analysis
UC2 : Tenders automatic information extraction
venerdì 16 aprile 2010
UC1 : Source
An online announcement site for sellers and buyers
Wide purpose (cars, RE, hi-fi, etc...)
Local scope (Rome and nearby)
venerdì 16 aprile 2010
UC1 - Goals
Are you looking for houses?
A specified subcategory of the site is dedicated to real estate
I would like to monitor Rome real estate market to
Take smart decisions
Predict how things will go in the (near) future
venerdì 16 aprile 2010
UC1 - Sourcevenerdì 16 aprile 2010
UC1 - Goals
I want to build a separate web application to monitor such estate listings
I have to use a crawler to automatically download selected pages periodically from the source
Estate listings text is unstructered
I want to make aggregate queries on structured information
venerdì 16 aprile 2010
UC1 - Information Extraction
I have to write an information extraction engine to populate a relational schema DB with structured information from free text of estate listings
venerdì 16 aprile 2010
UC1 - Blocksvenerdì 16 aprile 2010
UC1 - Crawler
A specialized crawler extract data from the source
Estate listings data are stored grouped by zones in files on some directory on a managed machine
venerdì 16 aprile 2010
UC1 - Crawler
Define navigation of the site using one XML for each city zone
The crawler downloads page fragments two times a week
The estate listings extracted free text is saved on XML grouped by zone
venerdì 16 aprile 2010
UC1 - Crawler Modulesvenerdì 16 aprile 2010
UC1 - navigation definitionvenerdì 16 aprile 2010
UC1 - Crawler
Issues :
Enabled cookies
Some HTTP headers needed
Needed to put fixed sleeping intervals between requests
venerdì 16 aprile 2010
UC1 - Domain
EstateListing (Announcement)
Zone
MagazineNumber (Uscita)
HouseStructure with properties
venerdì 16 aprile 2010
UC1 - Information Extraction Engine
Goal : extract price, zone and telephone number
The first version contained a specialized IE engine which used huge regular expressions
Hard to maintain and unefficient
Extracting not so much information
venerdì 16 aprile 2010
UC1 - IE Engine
New requirement: extract also the structure of the house
Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc...
Using again RegEx resulted to be hard to maintain and unefficient
venerdì 16 aprile 2010
UC1 - IE EngineSubsitute the RegEx based IE engine with a UIMA based IE engine to:
exploit previous work (RegExs can live inside UIMA too)
exploit existing components
be able to modify and enhanche IE rules easily
much more efficient
more information extracted
venerdì 16 aprile 2010
UC1 - Analysis pipelinevenerdì 16 aprile 2010
UC1 - TypeSystemvenerdì 16 aprile 2010
Crawled XMLvenerdì 16 aprile 2010
Sample text
“ven 26 Dic APPIA via grottaferrata metro 2 ¡ piano assolato ingresso salone americana cucina camera cameretta bagno soppalco posto auto e 295.000”
venerdì 16 aprile 2010
UC1 - ContentAnnotator
From the XML produced by the crawler only estate listings must be extracted
A simple parser to get each node containing an estate listing (that in turn will be unstructured)
Create a ContentAnnotation over the document
venerdì 16 aprile 2010
UC1 - ContentAnnotatorvenerdì 16 aprile 2010
ContentAnnotationvenerdì 16 aprile 2010
UC1 - ACAnnotatorvenerdì 16 aprile 2010
UC1 - Entitiesvenerdì 16 aprile 2010
ZoneAnnotator - Dictionary & RegEx
venerdì 16 aprile 2010
ZoneAnnotator - Learning dictionaries
venerdì 16 aprile 2010
UC1 - ZoneAnnotationvenerdì 16 aprile 2010
UC1 - Consuming extracted information
the previous version of the IE engine produced (again) XMLs that needed to be parsed to store structured data inside the DB
with UIMA a CAS Consumer at the end of the analysis pipeline can automatically put extracted information on the DB
venerdì 16 aprile 2010
UC1 - Analyzing real estate market data
a simple webapp written in Java with Spring framework modules (Spring core, DAO, JDBC, MVC) querying aggregate data on MySQL DB
enriched UI with JQuery
venerdì 16 aprile 2010
UC1 - Analysis Graphsvenerdì 16 aprile 2010
UC1 - Analysis Graphsvenerdì 16 aprile 2010
UC2 - Monitor of tenders/announcements
Monitor various sources which provide announcement and tenders to which people and companies are interested can subscribe
We want to automate the long monitoring process of such sources and also automatically extract useful common information from announcements’ text
venerdì 16 aprile 2010
UC2 - Blocksvenerdì 16 aprile 2010
Different input textsvenerdì 16 aprile 2010
Different input textsvenerdì 16 aprile 2010
Different input textsvenerdì 16 aprile 2010
Different input textsvenerdì 16 aprile 2010
UC2 - Crawling
Similar to UC1 Crawler but using a Firefox plugin we can define navigation patterns for pages of each source
We can also define metadata we see during navigation that deliver information
Again an XML will be generated so that it can be saved on a storage and executed periodically
venerdì 16 aprile 2010
UC2 - Defining navigationvenerdì 16 aprile 2010
UC2 - Domain annotations
Language
Abstract
Activity
Beneficiary
Budget
Expiration date
Funding type
Geographic region
Sector
Subject
Title
Tags
venerdì 16 aprile 2010
UC2 - Domain entities
First and most important is an entity that represents the entire tender or announcement
Annotations inside the domain will finally fill such entity properties
venerdì 16 aprile 2010
UC2 - Pipelinevenerdì 16 aprile 2010
Each annotator first looks:
if some metadata was extracted during navigation
for the most common pattern for defining information inside such announcements
i.e.: “Budget: 200000$” or “Financial information: ......”
Such patterns are language independent (although this is often not true)
UC2 - Simple first
venerdì 16 aprile 2010
UC2 - AbstractAnnotator
The abstract is usually in the first part of the document
We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences
We use Dictionary to provide a list of “good” words
We look in the first sentences of the document looking for objectives of the announcement (mixing good words and regular expressions)
venerdì 16 aprile 2010
UC2 - ExpirationDateAnnotator
A DateAnnotator is executed before
Iterate over DateAnnotations
Get sentences wrapping such DateAnnotations
Check if some terms like “deadline” appear in the same sentence of a DateAnnotation
venerdì 16 aprile 2010
Date patternsvenerdì 16 aprile 2010
ExpirationDateAnnotatorvenerdì 16 aprile 2010
GeographicRegionAnnotatorvenerdì 16 aprile 2010
UC2 - ActivityAnnotatorvenerdì 16 aprile 2010
UC2 - ActivityAnnotatorvenerdì 16 aprile 2010
Conclusions on IE
UC1 : simple and stable sentence patterns
UC2 : multi language, much more complex and different sentence structures and patterns
Fine grain metadata are very important
Need to play with NLP
Need to establish good test cases
venerdì 16 aprile 2010