Constraint-based Information Integration
Steven MintonFetch Technologies
Joint work with Craig Knoblock and Jose Luis Ambite (USC/ISI)
Example Application
Tiger MapServer
Geocoder
Zagat Restaurants Guide
Integration System
LA CountyRestaurant
Health Ratings
Outline Agents that access information
sources on the web AgentBuilder – learning from
examples ActiveAtlas -- standardizing data from
multiple sources Constraint-based Integration
Heracles – putting it all together
Information Agents
Decision SupportDecision Support Application ProgramsApplication Programs
Information AgentInformation Agent
Knowledge BasesKnowledge BasesDatabasesDatabases Computer ProgramsComputer ProgramsThe WebThe Web
Web Agents
Web agents provide uniform query language for data access: “Wrapping a web site”
Restaurants inSanta Monica?
Name AddressChinois on Main 2709 Main St.Chao Dara 13 Union Sq.
… ...
AgentBuilder Supervised learning: Extraction
rules created from examples High precision High reliability
Extraction technology Expressive extraction rule
language: Extraction rule = sequence of
landmarks Describes how to find the beginning
and end of each field
Start: SkipTo(Cuisine :) SkipTo(<b>) End: SkipTo(</b>)
PAGE:<html> Name:<b> KFC </b> Cuisine :<p> <b> Fast Food </b> <br>...
A Sequential Covering Algorithm for “Wrapper Induction”
Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ...
Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …
A Sequential Covering Wrapper Induction Algorithm
Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ...
Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …
Initial candidate: SkipTo( ( )
A Sequential Covering Wrapper Induction Algorithm
SkipTo( <b> ( ) ... SkipTo(Phone) SkipTo( ( ) ... SkipTo(:) SkipTo(()
Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ...
Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …
Initial candidate: SkipTo( ( )
A Sequential Covering Wrapper Induction Algorithm
SkipTo( <b> ( ) ... SkipTo(Phone) SkipTo( ( ) ... SkipTo(:) SkipTo(()
Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ...
Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …
Initial candidate: SkipTo( ( )
… SkipTo(Phone) SkipTo(:) SkipTo( ( ) ...
Outline Agents that access information
sources on the web AgentBuilder – learning from
examples Atlas -- standardizing data from
multiple sources Constraint-based Integration
Heracles – putting it all together
The Problem:Multi-Source Inconsistency
How can the same objects be identified when they are stored in inconsistent text formats?
Art’s DelicatessenCa’ BreaCPKThe GrillPatinaPhilippe’s The OriginalThe Tillerman
Art’s DeliCalifornia Pizza KitchenCampanileCitrusGrill, ThePhilippe The OriginalSpago
Zagat’s Restaurant Guide Health Dept Restaurant Listings
The Solution: Record Linkage
Name Street Phone
Art’s Deli 12224 Ventura Boulevard 818-756-4124
Teresa's 80 Montague St. 718-520-2910
Steakhouse The 128 Fremont St. 702-382-1600
Les Celebrites 155 W. 58th St. 212-484-5113
Name Street Phone
Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100
Teresa's 103 1st Ave. between 6th and 7th Sts. 212/228-0604
Binion's Coffee Shop 128 Fremont St. 702/382-1600
Les Celebrites 160 Central Park S 212/484-5113
Zagat’s Restaurants Dept. of Health
Zagat’s Agent Dept. of Health Agent
Query
Record Linkage
Name Street Phone
Art’s Deli 12224 Ventura Boulevard 818-756-4124
Teresa’s 80 Montague St. 718-520-2910
Steakhouse The
128 Fremont St. 702-382-1600
Les Celebrites 155 W. 58th St. 212-484-5113
Name Street Phone
Art’s Delicatessen
12224 Ventura Blvd. 818/755-4100
Teresa’s 103 1st Ave. between 6th and 7th Sts.
212/228-0604
Binion’s Coffee Shop
128 Fremont St. 702/382-1600
Les Celebrites 5432 Sunset Blvd 212/484-5113
Zagat’s Dept of Health
Approach to Record Linkage
Learning attribute weighting rules
Learning general transformation rules
Name Street Phone
Zagat’s
Dept of Health
Art’s Deli 12224 Ventura Boulevard 818-756-4124
Art’s Delicatessen 12224 Ventura Blvd. 818/756-4124
Art’s DeliCalifornia Pizza KitchenPhilippe The Original
Zagat’s Dept of Health
Art’s DelicatessenCPKPhilippe’s The Original
AbbreviationAcronymStemming
TransformationsRules
Active Learning to Determine Matched Records[Tejada, Knoblock, Minton ’01,’02]
Learn importance of attributes for matching records
Zagat’s
Dept of Health
Art’s Deli 12224 Ventura Boulevard 818-756-4124
Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100
Name Street Phone
Mapping rules:
Name > .9 & Street > .87 => mapped
Name > .95 & Phone > .96 => mapped
Active AtlasMapping Rule Learner
Set of Mapped Objects
Choose initial examples
Generate committee of learners
Learn Rules
ClassifyExamples
Votes Votes Votes
Choose Example
USERLearn Rules
ClassifyExamples
Learn Rules
ClassifyExamples
Label
Label
Committee Disagreement
Chooses an example based on the disagreement of the query committee
CPK, California Pizza Kitchen is the most informative example
Art’s Deli, Art’s DelicatessenCPK, California Pizza KitchenCa’Brea, La Brea Bakery
Yes Yes Yes Yes No Yes No No No
Examples M1 M2 M3Committee
Outline Agents that access information
sources on the web AgentBuilder – learning from
examples ActiveAtlas -- standardizing data from
multiple sources Constraint-based Integration
Heracles – putting it all together
Constraint-based Integration
Integrating data from multiple sources often involves reasoning about the information
Constraints provide a approach to expressing relationships and filtering data
Heracles Framework for building integrated
applications Interleaves planning and information
gathering Uses a constraint reasoner to decide
what sources to query and to integrate the results
The Travel Assistant
Dynamically Updates Slots as Information Becomes Available
BLACK
GREEN
GREEN
GREEN
GREEN
GREEN
GREEN
GREEN
GREEN GREEN
GREEN GREEN
BLACK
GREEN GREEN
GREENBLUE
BLUE RED
REDRED
RED
RED
RED
RED
RED
RED
RED
Supports Informed Choices
Changes Propagate Throughout
User Can Specify High-Level Preferences
Constraint Networks for Managing Information Constraint reasoning system
Propagates information Decides when to launch information requests Evaluate constraints Computes preferences All run as asynchronous processes to support the
user Components:
Representation of the variables Representation of constraints Hierarchical templates Constraint propagation
Constraint Networks for Integrating Information Components:
Representation of the variables Representation of constraints Hierarchical template representation Constraint propagation and cycle
detection
Constraint Variables
Constraint network consists of a set of variables such as: MeetingStartTime MeetingLocation
Variables are related by constraints that determine the possible values of a solution
Constraint Networks for Integrating Information Components:
Representation of the variables Representation of constraints Hierarchical template representation Constraint propagation and cycle
detection
Constraint Representation
Constraints are computable components: Local calculations (e.g., Xquery)
MeetingStartTime + MeetingDuration --> MeetingEndTime
Web and Database Wrappers ITN: DepartureAirport, ArrivalAirport, Date --> Flights Yahoo Weather: City, Date --> Weather predication
External Programs (Outlook, Planners, etc) Outlook Calendar: Date --> Meetings
Results cached in tables
DepartureDate
ReturnDate
computeDurationDepartureAirport
ParkingRate
ParkingTotalDurationgetParkingRate
TaxiFare
DestinationAddress
GetTaxiFare
multiply SelectModeToAirport
ModeToAirport
Sep 30, 2000
Oct 2, 2000
3 days
LAX
$7.00/day
$21.00 $23.00
Drive
GetDistanceOriginAddress
Distance
15.1 miles
FindClosestAirport
Drive or Take a Taxi?
Constraint Networks for Integrating Information Components:
Representation of the variables Representation of constraints Hierarchical template representation Constraint propagation and cycle
detection
Hierarchically-Partitioned Constraint Networks
Template: Groups related variables and constraints Organizes information for computation and
presentation to user Templates organized hierarchically
Template decomposed into subtemplates Choose among alternative subtemplates
Template Structure
Template Arguments: input and output
variables Variables: name, type, default values Constraints Expansions: alternative subtemplate
calls GUI specification
Who Company
Subject
Starting Time
Ending Time
Origin Addr.Dest. Addr.
OriginWeatherDest Weather
Distance
Travel Mode
Depart Time Depart Airport
Arrival Airport
Flight Num
Arrival Time Parking Lot
Parking Rate Mode toAirport
Dist. toAirport
Taxi Fare
Partitioned Constraint Network
Template Hierarchy for the Travel Assistant
Trip
ModeNext
Drive
ModeToDestination
Fly
ModeToAirport
Taxi
FlightDetail
Hotel
ModeHotel
NoOvernight
1
1 2
32
Trip(Return Home)
Trip(Return Office)
Trip(New Leg)
ModeFromAirport3 End
Trip
AND
OROR OR
AND
Drive Taxi
OR
Drive Taxi
OR
Dynamic NetworksGeneralization of Constraint Networks Variables can be active or inactive Normal Constraints x1 = k1 ^ … ^ xm = km xn = kn
Activity constraints: x1 = k1 ^ … ^ xm = km active(xn) Inactive variables do not participate in the
network, i.e., do not propagate constraints
Heracles: Template Selection Core network
Computes values of template selection vars
Always active Template selection variables
Inputs to activity constraints: determine the choice of subtemplates, i.e., which additional variables are active
Constraint Networks for Integrating Information Components:
Representation of the variables Representation of constraints Hierarchical template representation Constraint propagation
Constraint Propagation
Approach When a variable is assigned a value, re-compute the value
sets and assigned values of all dependent variables Proceeds recursively until no values are changed or a cycle
is detected Core network
Propagates all variables through the core network Remaining variables are computing when a template is
opened Does not perform full CSP
Less costly Does not require all information in advance Makes choices locally, so may fail to find optimal
assignment
Discussion General framework for interleaving
planning and information gathering Retrieves information as needed Gathers and integrates data in a uniform
framework Evaluates tradeoffs and selects among
alternatives Allows the users to explore alternatives Supports a wide variety of information types:
databases, web pages, images, video, etc.
SmartClients [Torrens et al, 2002]
Cast an integration problem as a Constraint Satisfaction Problem (CSP)
Given a request, the server retrieves the required data and sends the data and the CSP to the client
Client solves the CSP locally Large complex problem transmitted in small
amount of space Provides fine-grained user interaction with
the data
Architecture for SmartClients
SmartClients: Pros and Cons Pros
Elegant approach that exploits past work on CSPs
Minimizes the data retrieval and supports complex reasoning and integration of the data
Cons Assumes that all data can be retrieved before
any reasoning about the data In the travel planning, assumes that prices
are the same on any date and there are no issues with flight availability
Summary Our approach for creating
“web assistants”: Agents for accessing web data Record linkage for mapping
between sources Constraint-based integration
provides the glue