On Leveraging Crowdsourcing Techniques for Schema Matching Networks

Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Karl Aberer

École Polytechnique Fédérale de Lausanne, Switzerland

Zoltán Miklós

Université de Rennes 1, IRISA, France

DASFAA 2013, Part II, LNCS 7826, pp. 139 – 154, 2013

Database schema matching is an active research field:Surveys: [1], [2]Applications: data transformation, data migration, data alignment, …Automatic Matching Tools: COMA++, AMC, OpenII, Falcon, …

Schema matching is the task of establishing correspondences that connect related attributes in two (independently developed) database schemas.

[1] Rahm, E. et al. “A Survey of Approaches to Automatic Schema Matching”. JVLDB, 2001[2] Bernstein, P.A. et al. “Generic Schema Matching, Ten Years Later”. PVLDB, 2011

BirthName BirthName

BirthDate

AddressAddress

Automatic schema matchers will(sometimes) fail to identify the correct correspondences

There is a need for post‐matchingreconciliation through human inputThis effort is the « real cost » in the company

Schemas do not appear alone, they are part of a matching network

The network‐level consistency constraintsare very important for business users

Real‐world scenario: a repository of schemas in the same domain

Schema matching network: connect schemas by pair‐wise matchings

Network‐level consistency constraints

Automatic tools produce incorrect correspondences need validation by human

DASFAA’2013, BDA’2013: On LeveragingCrowdsourcing Techniques for SchemaMatching NetworksER’2013: Minimizing Human Effort in Reconciling Match NetworkscoopIS’2013: Collaborative Schema MatchingReconciliationICDE’2014: Pay‐as‐you‐go Reconciliation in Schema Matching Networks

“Crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, and especially from an online community, rather than from traditional employees or suppliers.” ‐Wiki

Our context: employ many workers (users) to validate same correspondences and combine their answers.

Surveys: [1], [2]A wide range of applications (e.g. CrowdSearch) have been developed on top of more than 70 crowdsourcing platforms (e.g. Amazon Mechanical Turk).

Our contribution:Define network‐level constraints in schema matching networkDesign questions for workers to validate correspondencesLeverage network‐level constraints to reduce user efforts

[1] E. Law et al. “Human Computation”. Morgan & Claypool Publishers, 2011[2] A. Doan et al. “Crowdsourcing systems on the World Wide Web”. CACM, 2011

Three elements of questions:Asking object: correspondencePossible choices: simple YES/NO questionSupport Information: alternatives, constraint satisfactions, constraint violations

User Question Answer

U1 C Yes

U2 C Yes

U3 C No

User Reliability

U1 r1U2 r2U3 r3

User Feedbacks

Answer Aggregation

User Quality

Probabilistic Model (*)

Corr Aggregation Error Rate

C True 0.19

Compute <a,e> aggregation + error rate

r1 = Pr (C=true | U1=yes) = Pr (C=false | U1=no)

(*) Majority Voting, Expectation Maximization, …See full paper for details

Solution: Leverage constraints to reduce error rate

r = 0.6

To achieve higher accuracy, we need more answers Cost‐Accuracy Tradeoff

Idea: correspondences support each other if they satisfy a constraint

1‐1 constraint: ONE source attribute matches to only ONE target attribute

Pr(ab1=true) = 0.8

Pr(ab2=false) = 0.6

ab1 ab2 ProbT T 0.32 not satisfyT F 0.48 satisfyF T 0.08 satisfyF F 0.12 satisfy

Pr0.48 0.12

0.48 0.08 0.12.

With ConstraintWithout Constraint

ab2 False 0.4 (*)

ab2 False 0.12 (**)

(*) Error Rate = 1 – Pr (ab2=false) (**) Error Rate = 1 – Pr |

By independence, 0.8 x 0.6

Circle constraint: sequence of correspondences create a closed circleΔ: probability of compensating errors along the circle (*)

With ConstraintWithout Constraint

ab True 0.2 (**)

ab True 0.027 (***)

(**) Error Rate = 1 – Pr (ab=T) (***) Error Rate = 1 – Pr

bPr(ab=T) = 0.8

Pr(ac=T) = 0.8 Pr(bc=T) = 0.8

ab bc ac ProbT T T 0.512 1.0T T F 0.128 0.0T F T 0.128 0.0T F F 0.032F T T 0.128 0.0F T F 0.032F F T 0.032F F F 0.008

Pr0.512 Δ 0.032

0.512 3 Δ 0.032 Δ 0.008. with .

By independence, 0.8 x 0.8 x 0.8

* Cudré-Mauroux, et al. Probabilistic message passing in peer data management systems. ICDE 2006.

Settings:Real‐world schemas. Use ground truth to simulate users/workers. Error Threshold = 0.1 : make decision when error rate < 0.1; otherwise, continue to ask users.Metric: Cost =

Observation: Cost (With Constraints) Cost (Without Constraints)

We model a crowdsourcing process for schema matching network

address optimization goals: minimize monetary cost, maximize accuracy (minimize error rate).

We design a variety of questions with different support information.We leverage consistency constraints reduce error rate reduce the monetary cost.

On Leveraging Crowdsourcing Techniques for Schema Matching Networks

Internet

Schema less table & dynamic schema

Distributional Models vs. Linked Data: leveraging crowdsourcing to personalize music playlists

Crowdsourcing with Smartphones - University of Cyprusdzeina/papers/ic12-crowdsourcing.pdf · collection enabling new crowdsourcing applications. Crowdsourcing applications on smartphones

From XML Schema to JSON Schema

Crowdsourcing & Culture

Leveraging Schema Information For Improved Knowledge Graph

Guide to schema and schema extensibility

Crowdsourcing Quotes – Crowdsourcing Zitate - innosabiinnosabi.com/wp-content/...Crowdsourcing-Quotes.pdf · Henk van Ess Author and Independent Media Consultant “Crowdsourcing

Crowdsourcing Geodata - GEOFABRIK // Home › media › 2009-09-08-crowdsourcing... · 2020-02-28 · Crowdsourcing: open / closed “closed” crowdsourcing: users as cheap labour,

Crowdsourcing Exploration - Stanford Universitystanford.edu/~kostasb/publications/crowdsourcing...Crowdsourcing Exploration Yiangos Papanastasiou Haas School of Business, University

Mirror Mirror: Crowdsourcing Better Portraitspeople.csail.mit.edu › junyanz › projects › mirrormirror › ... · Mirror Mirror: Crowdsourcing Better Portraits ... We use crowdsourcing

Crowdsourcing and Human Computer Interaction Designcrowdsourcing-class.org/slides/crowdsourcing-and-HCI.pdf · Crowdsourcing and Human Computer Interaction Design ... Instructor:

Harnessing the power of crowdsourcing and Internet of ... · lyzed and used: such crowdsourcing is deﬁned as passive crowdsourcing. Another case of passive crowdsourcing occurs

Crowdfunding & Crowdsourcing

Translation Crowdsourcing

ACRyLIQ: Leveraging DBpedia for Adaptive Crowdsourcing in ... · ACRyLIQ: Leveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality Assessment Umair ul Hassan1(B), Amrapali

Leveraging crowdsourcing techniques and technologies to generate better …gsars.org/wp-content/uploads/2017/10/AMIS-crowdsourcing_Final.pdf · 1 Leveraging crowdsourcing techniques

Talent Crowdsourcing: The Quick Guide - Amazon S3Crowdsourcing+E-book+-+v6.0.pdf · Talent Crowdsourcing is essentially, crowdsourcing applied to recruiting. Talent Crowdsourcing

White Paper: Crowdsourcing Healthcare - Leveraging Open Innovation Software to Support the Development of Innovative Solutions

Ugc & crowdsourcing