1
Interac(vely Building Geospa(al Mashups
Craig A. Knoblock University of Southern California
Work in collabora(on with Shubham Gupta, Pedro Szekely, and RaHapoom Tuchinda
MASHUPS
• A website or application that combines content from more than one source into an integrated experience [wikipedia]
2
a) LA crime map c) Ski bonk b) zillow.com
Combined Data gives new insight / provides new services
- Crime Report from different counties - Map
- Real Estate Listing -Property Tax
- Weather -Snow Report -Snow Resorts
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
PROBLEM
• Most Mashups require significant exper(se to create
• Demand for crea(ng integrated applica(ons is huge
• Every user has their own unique requirements for an integrated applica(on
• Available sources and needs to integrated data con(nues to grow
MASHUP BUILDING ISSUES
4
Wrapper Wrapper Data Retrieval
Clean Clean
AHribute AHribute Calibra(on ‐source modeling ‐cleaning
Combine Integra(on
Customize Display
Display
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
EXISTING APPROACHES
5
Goal: Create Mashups without Programming • Doesn’t translate to not having to understand programming
Yahoo’s Pipes
Widget Paradigm ‐ Widgets (i.e., 43 for Pipes, 300+
for MS) represents an opera(on on the data
‐ Loca(ng and learning to customize widget can be (me consuming
‐ Most tools focus on par(cular issues and ignore others
Can we come up with a framework that addresses all of the issues while s(ll making the Mashup building process easy?
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
KEY CONTRIBUTIONS
• A programming by demonstra(on approach that uses a single table for building a Mashup
• An integrated approach that links data extrac(on, source modeling, data cleaning, and data integra(on together
• A query formula(on technique that allows users to specify examples to build complicated queries
6 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
KEY IDEAS
• Focus on data, not opera(ons – Users are more familiar with data
• Leverage exis(ng data – Help source modeling, cleaning, and data integra(on
• Consolidate as opposed to Divide‐And‐Conquer – Solving a problem in one issue can help solve another issue
– Interac(ng within a single spreadsheet pladorm
7 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
KARMA USER INTERFACE
Data Source Types Currently Supported by Karma
Various Informa(on Integra(on Opera(ons
Data Table – Spreadsheet Type Interface
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
MAP
{EvacCenter_ID, Address, City}
Extract
{EvacCenter_ID, Address, City
Evacua(on Centers CSV
{Date, Injuries, Fatali(es}
Injury sta(s(cs in Excel Spreadsheet
Visualize as chart
Extract
{Headlines, Summary, Date, Link}
Google News Website
Visualize as bulleted list
Extract
Extract
{Name, City, Phone No.}
Clean
Emergency Coordinator MySQL Database
, Name, Phone No.}
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
INTEGRATION SCENARIO
RETRIEVING DATA FROM DIVERSE SOURCES
• Karma facilitates retrieval of data from structured data‐sources, such as Excel spreadsheets, MySQL databases and CSV files
• Karma also facilitates the extrac(on of data from semi‐structured data sources such as web pages
CSV Text File MySQL Database
Excel Spreadsheet HTML Web Page
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
EXTRACTION BY EXAMPLE
• The retrieval of data from structured data‐sources, such as Excel sheets and CSV files is done through a drag and drop mechanism
• The user is only required to select a sample data‐element and drop it into Karma’s data table
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
EXTRACTION FROM THE WEB
12
Tbody/tr[1]/td[2]/a TBODY tr tr
td td
1. 2. Japon Bistro
td
a br br
970 E Colora.. Upscale yet affordabl..
td
a br br
8400 Wilshir. Chic elegance…..
Hokusai
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
Tbody/tr*/td*/a
EXTRACTION FROM THE WEB
13
TBODY tr tr
td td 1. 2.
Japon Bistro
td a br br
970 E Colora.. Upscale yet affordab
td a br br
8400 Wilshir. Chic elegance…
Hokusai
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
EXPLOITING WRAPPER LIBRARIES
Wrapper Library: Karma lists all the available wrappers on
the local machine.
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
SOURCE MODELING
• Karma automa(cally generates the seman(c types of each aHribute to learn the underlying model of the data source
• Supervised machine learning techniques are used to generate a set of paHerns for each seman(c type from training data
Ini(al Type Manually label the data with the correct seman(c type to train Karma
When the new data is imported of same type, Karma automa(cally labels it correctly
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
LEARNING SEMANTIC TYPES
:StreetAddress: :Email: 4DIG CAPS Rd [email protected] 3DIG N CAPS Ave [email protected] … … :State: :Telephone: CA (3DIG) 3DIG-4DIG 2UPPER +1 3DIG 2DIG 4DIG … …
Background knowledge
learn Patterns
label
Idea: Learn a model of the content of data and use it to recognize new examples
DATA CLEANING • Karma performs the data cleaning by learning and applying the
transforma(on rules that are learned from examples
Ini(al data source User provides example
Karma learns a transforma(on rule and applies to remaining data Data source auer cleaning
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
DATA CLEANING: PREDEFINED TRANSFORMATIONS
18
.
.
.
Predefined Rules
31 Reviews → 31
Subset Rule: (s1s2..sk) → (d1d2…dt) ∧ (k <= t) ∧ si ∈ {d1,d2,…,dt} ∧ di ≠ dj
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
DATA INTEGRATION
• Karma discovers the related sources by detec(ng and ranking associa(ons based on the common aHribute names and matching seman(c types
• Karma suggests poten(al joins between the current data sources in the form of column comple(ons
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
USER SELECTS FROM COLUMN COMPLETIONS
MySQL Database loaded as a another source in Karma
Karma suggests the possible column comple(ons in a drop down list
Karma executes the join query once the user selects an op(on
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
DATA VISUALIZATION • Visualiza(on by demonstra(on approach
– The user demonstrates to Karma the kind of visualiza(on desired for the data specified through examples using a drag and drop mechanism
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
DATA VISUALIZATION
Karma currently supports four types of visualiza(on formats:
1. Chart Format: Useful for visualizing numerical sta(s(cs, (me based events etc
2. Paragraph Format: Useful for visualizing descrip(ve text data such as Wikipedia defini(ons
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
DATA VISUALIZATION
3. List Format: Useful for visualizing informa(on in a bulleted list such as list of summarized news ar(cles
4. Table Format: Useful for visualizing informa(on that is best presented in a row‐and‐column format such as numerical values etc
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
RESULTS CAN BE PUBLISHED IN MULTIPLE FORMATS
• Karma lets you export your final mashup in variety of formats:
‐ HTML Page
‐ Database table
‐ KML Layer ‐ XML File
‐ CSV Text File
Different mashup publishing op(ons
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
AUTOMATICALLY FINDS GEOSPATIAL REFERENCES
• Final mashup output in HTML web page format: ‐ Karma iden(fies geospa(al informa(on in the current data with the help of geographic
seman(c types such as PR‐Address, PR‐La(tude etc ‐ The Google geocoding service is used to find the coordinates for a given address
‐ Karma uses the coordinates informa(on to place the markers in the final mashup
Poten(al geographic informa(on Op(ons to publish mashup as HTML
web page
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
CONSTRUCTS A MAP WITH USER‐DEFINED LAYOUT
• Final mashup as a HTML web page:
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
RESULTS CAN BE EXPORTED AS KML
• Final mashup output as a KML layer
Op(ons to publish mashup as KML
layer
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
KML LAYERS CAN BE OPENED IN GOOGLE EARTH
The generated KML layer can be viewed in a GIS souware such as Google Earth
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
RESULTS CAN BE STORED IN A DB • The final mashup data can also be saved into a database table by providing
the details about the database loca(on, username and password, etc in Karma
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
EVALUATION
30
• Baseline: A combina(on of Dapper/Pipes • Claims:
1. Users with no programming experiences can build all four Mashup types.
2. Karma takes less (me to complete each subtask and scales beHer as the tasks get harder
3. Overall, the user takes less (me to build the same Mashup in Karma compared to Dapper/Pipes
• Users: – Programmers (20) – Non‐programmers (3)
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
EVALUATION: SETUP
31 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
Familiariza(on ‐Programmers (2 assignments on DP) ‐Review Package ‐30 minutes tutorial
Prac(ce ‐ 2‐3 tasks using Karma
Test (3 tasks) ‐Programmers: Alterna(ng between Karma vs. DP for each task ‐Non Programmers: use only Karma ‐Screen are recorded using video capture souware
5 minute cut off (me
EVALUATION: TASKS
• Claim 1: Users with no programming experiences can build all four Mashup types
• Claim 2: When the Mashup subtask is difficult, Karma takes less (me to complete that subtask
• Claim 3: Overall, the user takes less (me to build the same Mashup in Karma compared to Dapper/Pipes
Task No. Mashup Type
Data Extraction
Source Modeling
Data Cleaning
Data Integration
1 1 (1 source)
Moderate Simple Difficult N/A
2 2,3 (union+form)
Difficult Simple Simple Union (simple)
3 4 (join 2 sources)
Simple Simple N/A Join (difficult)
32 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
33
Claim 1: Users with no programming experiences can build all four Mashup types
33 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
EVALUATION: NON‐PROGRAMMERS
34 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
35
Claim 2: Karma takes less (me to complete each subtask
35 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
EVALUATION: EXTRACTION
36 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
• As the extrac(on task gets more difficult, Dapper/Pipes takes
- longer - more subjects failing to complete the task (11% for moderate and 25% for difficult)
Dapper/Pipes Karma
Karma (programmer) Dapper/Pipes
EVALUATION: SOURCE MODELING
37 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
• Karma performed worse in task 1 and tasks 2
- only 30 sec difference - subjects take (mes selec(ng aHributes - the saving will be realized in the data integra(on step.
• Karma performed beHer in task 3 because it can automa(cally iden(fy the aHribute
Dapper/Pipes Karma
EVALUATION: DATA CLEANING
38 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
• Karma performed beHer in both tasks
• When the cleaning task gets harder, more subjects are failing in Dapper/Pipes (35% for simple and 83% in hard)
Dapper/Pipes Karma
EVALUATION: DATA INTEGRATION
39 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
• Because of the table structure, subjects can specify union indirectly by dropping data into the right cell
• The (me spent in source modeling step allows Karma to suggest the linking source
• Dapper/Pipes: 30% fail in the union case and 95% fail in the join case
Dapper/Pipes Karma
40
Claim 3: Overall, the user takes less (me to build the same Mashup in Karma compared to Dapper/Pipes
40 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
EVALUATION: OVERALL
41 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
Dapper/Pipes Karma
EVALUATION: AVERAGE
42 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
2.22x
0.67x
4.16x 6.49x
3.32x
Dapper/Pipes Karma
RELATED WORK: MASHUP BUILDING TOOLS
1: Extrac(on, 2: Union, 3: Form‐based Interac(on, 4: Join
43 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
Require an expert
Mainly focus on extrac(on / linear
Q/A approach / linear / scalability
Widgets Fancier UI/ more widgets
Fewer Widgets / Confusion on workflow
Early work. Focus on DOM, too basic
Create points on Map
RDF / Manually specify data int
Tuple = card. Drawing links for rela(ons
RELATED WORK: DATA EXTRACTION
• Automa(c extrac(on: table and lists only – RoadRunner (exploit HTML structure) [Crescenzi et al., 2001] – Adel (grammer induc(on to detect rows) [Lerman+ 2001] – VisualWeb (OCR technique to detect tables) [GaHerbauer+ 2007]
• Semi‐Automa(c: require more label examples – WIEN (induc(ve – less expressive than stalker) [Kushmerick 1997] – Stalker (Cotes(ng) [Muslea+ 1999] – SouMealy (finite state transducer) [Hsu 1998] – WHISK (rigid format, exact delimiter) [Soderland 1998]
• DOM: rely on well‐formed HTML and less labeling – Simile [Huynh+ 2005] – Dapper – Interac(ve Wrapper Genera(on (ML + predic(on on DOM)[Irmak+ 2006] – PLOW (add natural language) [Allen+ 2007] – Cards [Dontcheva+ 2007] – Karma [Tuchinda+ 2008]
44 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
RELATED WORK: SOURCE MODELING
• 1:1 mapping, N:M mapping – Schema‐level match
• TranScm [Milo+ 98] • DIKE [Palopoli+ 99] • Artemis [Castano+ 01] • Delta [Cliuon+ 97]
– +Instance‐based matcher • SemInt [Li 00] • LSD [Doan 01] • ILA [Etzioni 95] • iMapp [Dhamanka 04] • Clio (interac(ve) [Ling 01] • Inducing Source Descrip(on [Carman 07]
• Karma leverages exis(ng techniques to narrow candidate matches – String Similari(es [Cohen+ 2003]
45 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
RELATED WORK: DATA CLEANING
• Commercial Tools: Focus on wri(ng transforma(on – ACR/Data, Migra(on Architect [Chaudhuri+ 1997]
• Discrepancy Detec(on: Use as a stepping stone for record linkage and cleaning system – Levenshtein distance [Needleman+ 70] – Vector based [Baeza‐Yates+ 99] – EM [Ristad+ 98] – SVM [Bilenko+ 03]
• Record linkage & cleaning systems: Focus on ranking [Winkler 06] – Fuzzy Match [Chaudhuri+ 03] – Apollo [Michalowski+ 05] – Phoebus [Michelson+ 07] – PoHer’s wheel [Raman+ 01]
• Karma – Gains reference sources through source modeling process – Provides predefined transforma(ons
46 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
RELATED WORK: DATA INTEGRATION
• Universal Rela.on: Make it easier to formulate the query but users s(ll need to formulate the query [Ullman 1980, 1988]
• Query by example: Need to know which data sources to use and the query may not return results – QBE [Zloof 1975]
• Retrieval by formula.on: Need to understand domain model to formulate par(al descrip(on – Helgon [Fischer 1989], RABBIT [Williams 1982]
• Graphical Query Language: Users s(ll need to navigate through sources (graphs) – Gql [Benzi 1998, Haw 1994, Papantonakis 1988]
• Ques.on‐Answering Techniques: Understanding about database opera(ons required – Agent Wizard [Tuchinda+ 2004]
• Interac.ve Schema/data integra.on: Understanding about source schema required – Clio [Ling 01]
• Karma is based on Programming by Demonstra.on [Cyper 2001; Lau2001] 47
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
CONCLUSION • Mashups are a fast growing area – Need an efficient way to for casual web users to build them
• Contribu(ons – A PBD approach that uses a single table for building a Mashup
– An integrated approach that solves the various Mashup building issues
– A query formula(on technique that allows users to specify examples to build complicated queries
• Evaluated the validity of the Karma approach – Subjects were able to complete Mashup building tasks in Karma
– The overall improvement is at least a factor of 3.5 48
Introduc(on Approach Evalua(on Related Work Conclusion • • • •
FUTURE WORK
• Learn and generalize over the task – Store the integra(on plan so that it can be reexecuted on current data
• Support the integra(on of geospa(al data types (i.e., vector layers, raster layers)
• Improve the techniques for automa(c source modeling
• Learn new transforma(ons from examples for data cleaning
49 Introduc(on Approach Evalua(on Related Work Conclusion • • • •
PAPERS • Building geospa.al mashups to visualize informa.on for crisis
management. Shubham Gupta and Craig A. Knoblock. In Proceedings of the 7th Interna2onal Conference on Informa2on Systems for Crisis Response and Management, 2010.
• Interac.ve data integra.on through smart copy & paste. Zachary G. Ives, Craig A. Knoblock, Steven Minton, Marie Jacob, Partha Pra(m Talukdar, RaHapoom Tuchinda, Jose Luis Ambite, Maria Muslea, and Cenk Gazen, Fourth Biennial Conference on Innova2ve Data Systems Research (CIDR), 2009.
• Building mashups by example. RaHapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock. Proceedings of the 2008 Interna2onal Conference on Intelligent User Interfaces, 2008
• Building data integra.on queries by demonstra.on. RaHapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock. In Proceedings of the Interna2onal Conference on Intelligent User Interfaces, 2007
50 50 Introduc(on Approach Evalua(on Related Work Conclusion • • • •