Transcript
Page 1: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  

Global  NEST    

University  of  the  Aegean  

Crowdsourcing  Approaches  to  Big  Data  CuraDon  for  Earth  Sciences  

 Insight  Centre  for  Data  AnalyDcs,    

NaDonal  University  of  Ireland  Galway  

EarthBiAs2014   1  

Page 2: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

Take  Home  

Algorithms Humans Better Data Data

Page 3: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

Talk  Overview  

•  Part  I:  Mo=va=on  •  Part  II:  Data  Quality  And  Data  Cura=on  •  Part  III:  Crowdsourcing  •  Part  IV:  Case  Studies  on  Crowdsourced  Data  Cura=on  

•  Part  V:  SeIng  up  a  Crowdsourced  Data  Cura=on  Process  

•  Part  VI:  Linked  Open  Data  Example  •  Part  IIV:  Future  Research  Challenges  

7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  

Page 4: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

MOTIVATION  

PART  I  

7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  

Page 5: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

BIG Big Data Public Private Forum

THE BIG PROJECT

Overall objective

Bringing the necessary stakeholders into a self-sustainable industry-led initiative, which will greatly contribute to

enhance the EU competitiveness taking full advantage of Big Data technologies.

Work at technical, business and policy levels, shaping the future through the positioning of IIM and Big Data

specifically in Horizon 2020.

BIG Big Data Public Private Forum

Page 6: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

BIG Big Data Public Private Forum

SITUATING BIG DATA IN INDUSTRY

Health Public Sector Finance & Insurance

Telco, Media& Entertainment

Manufacturing, Retail, Energy,

Transport

Needs Offerings

Value Chain

Technical Working Groups

Industry Driven Sectorial Forums

Data Acquisition

Data Analysis

Data Curation

Data Storage

Data Usage

•  Structured data •  Unstructured data •  Event processing •  Sensor networks •  Protocols •  Real-time •  Data streams •  Multimodality

•  Stream mining •  Semantic analysis •  Machine learning •  Information extraction

•  Linked Data •  Data discovery •  ‘Whole world’ semantics

•  Ecosystems •  Community data analysis

•  Cross-sectorial data analysis

•  Data Quality •  Trust / Provenance •  Annotation •  Data validation •  Human-Data Interaction

•  Top-down/Bottom-up •  Community / Crowd •  Human Computation •  Curation at scale •  Incentivisation •  Automation •  Interoperability

•  In-Memory DBs •  NoSQL DBs •  NewSQL DBs •  Cloud storage •  Query Interfaces •  Scalability and Performance

•  Data Models •  Consistency, Availability, Partition-tolerance

•  Security and Privacy •  Standardization

•  Decision support •  Prediction •  In-use analytics •  Simulation •  Exploration •  Visualisation •  Modeling •  Control •  Domain-specific usage

Page 7: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

BIG Big Data Public Private Forum

SUBJECT MATTER EXPERT INTERVIEWS

Page 8: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

BIG Big Data Public Private Forum

KEY INSIGHTS

Key Trends ▶  Lower usability barrier for data tools ▶  Blended human and algorithmic data processing for coping with

for data quality ▶  Leveraging large communities (crowds) ▶  Need for semantic standardized data representation ▶  Significant increase in use of new data models (i.e. graph)

(expressivity and flexibility)

▶ Much of (Big Data) technology is evolving evolutionary

▶  But business processes change must be revolutionary

▶ Data variety and verifiability are key opportunities

▶  Long tail of data variety is a major shift in the data landscape

The Data Landscape ▶  Lack of Business-driven Big

Data strategies ▶  Need for format and data

storage technology standards ▶ Data exchange between

companies, institutions, individuals, etc.

▶  Regulations & markets for data access

▶  Human resources: Lack of skilled data scientists

Biggest Blockers

Technical White Papers available on: http://www.big-project.eu

Page 9: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

The Internet of Everything: Connecting the Unconnected

Page 10: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Earth Science – Systems of Systems

Page 11: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Page 12: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

CiDzen  Sensors  

“…humans  as  ci,zens  on  the  ubiquitous  Web,  ac,ng  as  sensors  and  sharing  their  observa,ons  and  views…”  

¨  Sheth,  A.  (2009).  Ci=zen  sensing,  social  signals,  and  enriching  human  experience.  Internet  Compu,ng,  IEEE,  13(4),  87-­‐92.  

Air Pollution

Page 13: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Page 14: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Page 15: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Citizens as Sensors

Page 16: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   16 of XYZ

Haklay, M., 2013, Citizen Science and Volunteered Geographic Information – overview and typology of participation in Sui, D.Z., Elwood, S. and M.F. Goodchild (eds.), 2013. Crowdsourcing Geographic Knowledge: Volunteered Geographic Information (VGI) in Theory and Practice . Berlin: Springer.

Page 17: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

DATA  QUALITY  AND  DATA  CURATION  

PART  II  

7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  

Page 18: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

The Problems with Data

Knowledge Workers need: ¨  Access to the right data

¨  Confidence in that data

Flawed data effects 25% of critical data in world’s top companies

Data quality role in recent financial crisis: ¨  “Asset are defined differently

in different programs”

¨  “Numbers did not always add up”

¨  “Departments do not trust each other’s figures”

¨  “Figures … not worth the pixels they were made of”

Page 19: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

What is Data Quality?

“Desirable characteristics for information resource” Described as a series of quality dimensions: n  Discoverability & Accessibility: storing and classifying in

appropriate and consistent manner

n  Accuracy: Correctly represents the “real-world” values it models

n  Consistency: Created and maintained using standardized definitions, calculations, terms, and identifiers

n  Provenance & Reputation: Track source & determine reputation

¨  Includes the objectivity of the source/producer

¨  Is the information unbiased, unprejudiced, and impartial?

¨  Or does it come from a reputable but partisan source?

Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 1996. 12(4): p. 5-33.

Page 20: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Data Quality

ID PNAME PCOLOR PRICE

APNR iPod Nano Red 150

APNS iPod Nano Silver 160

<Product  name=“iPod  Nano”>        <Items>                  <Item  code=“IPN890”>                              <price>150</price>                              <genera=on>5</genera=on>                  </Item>          </Items>  </Product>  

Source A

Source B

Schema Difference?

Data Developer

APNR  

iPod  Nano  

Red  

150  

APNR  

iPod  Nano  

Silver  

160  

iPod  Nano   IPN890  150  

5  

Value Conflicts? Entity Duplication?

Data Steward

Business Users

?

Technical Domain (Technical)

Domain

Page 21: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

What is Data Curation?

n  Digital Curation ¨ Selection, preservation, maintenance, collection,

and archiving of digital assets

n  Data Curation ¨ Active management of data over its life-cycle

n  Data Curators ¨ Ensure data is trustworthy, discoverable, accessible,

reusable, and fit for use – Museum cataloguers of the Internet age

Page 22: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Related Activities

n  Data Governance/ Master Data Management ¨ Convergence of data quality, data management,

business process management, and risk management

¨ Part of overall data governance strategy for organization

n Data Curator = Data Steward

n  DO

Page 23: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Types of Data Curation

n  Multiple approaches to curate data, no single correct way ¨ Who?

–  Individual Curators

– Curation Departments – Community-based Curation

¨ How? – Manual Curation –  (Semi-)Automated

– Sheer Curation

Page 24: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Types of Data Curation – Who?

n  Individual Data Curators ¨ Suitable for infrequently changing small quantity

of data –  (<1,000 records) – Minimal curation effort (minutes per record)

Page 25: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Types of Data Curation – Who?

n  Curation Departments ¨ Curation experts working with subject matter

experts to curate data within formal process – Can deal with large curation effort (000’s of records)

n  Limitations ¨ Scalability: Can struggle with large quantities of

dynamic data (>million records)

¨ Availability: Post-hoc nature creates delay in curated data availability

Page 26: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Types of Data Curation - Who?

n  Community-Based Data Curation ¨ Decentralized approach to data curation

¨ Crowd-sourcing the curation process – Leverages community of users to curate data

¨ Wisdom of the community (crowd)

¨ Can scale to millions of records

Page 27: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Types of Data Curation – How?

n  Manual Curation ¨ Curators directly manipulate data

¨ Can tie users up with low-value add activities

n  (Sem-)Automated Curation ¨ Algorithms can (semi-)automate curation

activities such as data cleansing, record duplication and classification

¨ Can be supervised or approved by human curators

Page 28: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Types of Data Curation – How?

n  Sheer curation, or Curation at Source ¨ Curation activities integrated in normal workflow

of those creating and managing data

¨ Can be as simple as vetting or “rating” the results of a curation algorithm

¨ Results can be available immediately

n  Blended Approaches: Best of Both ¨ Sheer curation + post hoc curation department

¨ Allows immediate access to curated data

¨ Ensures quality control with expert curation

Page 29: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Data Quailty

Data Curation Example

Profile Sources

Define Mappings

Cleans Enrich

De-duplicate Define Rules

Curated Data

Data Developer

Data Curator

Data Governance

Business Users

Applications

Product Data Product Data

Page 30: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Data Curation

n  Pros ¨  Can create a single version of truth

¨  Standardized information creation and management

¨  Improves data quality

n  Cons ¨  Significant upfront costs and efforts

¨  Participation limited to few (mostly) technical experts

¨  Difficult to scale for large data sources –  Extended Enterprise e.g. partner, data vendors

¨  Small % of data under management (i.e. CRM, Product, …)

Page 31: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

The New York Times

100 Years of Expert Data Curation

Page 32: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

The New York Times

n  Largest metropolitan and third largest newspaper in the United States

n  nytimes.com q  Most popular newspaper

website in US

n  100 year old curated repository defining its participation in the emerging Web of Data

Page 33: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

The New York Times

n  Data curation dates back to 1913 ¨ Publisher/owner Adolph S. Ochs decided to

provide a set of additions to the newspaper

n  New York Times Index ¨ Organized catalog of articles titles and summaries

– Containing issue, date and column of article

– Categorized by subject and names

–  Introduced on quarterly then annual basis

n  Transitory content of newspaper became important source of searchable historical data ¨ Often used to settle historical debates

Page 34: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

The New York Times

n   Index Department was created in 1913 ¨ Curation and cataloguing of NYT resources

–  Since 1851 NYT had low quality index for internal use

n  Developed a comprehensive catalog using a controlled vocabulary ¨ Covering subjects, personal names,

organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summaries

n  Current Index Dept. has ~15 people

Page 35: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

The New York Times

n  Challenges with consistently and accurately classifying news articles over time ¨ Keywords expressing subjects may show some

variance due to cultural or legal constraints

¨  Identities of some entities, such as organizations and places, changed over time

n  Controlled vocabulary grew to hundreds of thousands of categories ¨ Adding complexity to classification process

Page 36: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

The New York Times

n  Increased importance of Web drove need to improve categorization of online content

n  Curation carried out by Index Department ¨ Library-time (days to weeks)

¨ Print edition can handle next-day index

n  Not suitable for real-time online publishing ¨ nytimes.com needed a same-day index

Page 37: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

The New York Times

n  Introduced two stage curation process ¨ Editorial staff performed best-effort semi-

automated sheer curation at point of online pub. –  Several hundreds journalists

¨  Index Department follow up with long-term accurate classification and archiving

n  Benefits: ¨ Non-expert journalist curators provide instant

accessibility to online users

¨  Index Department provides long-term high-quality curation in a “trust but verify” approach

Page 38: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

NYT Curation Workflow

¨ Curation starts with article getting out of the newsroom

Page 39: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

NYT Curation Workflow

¨ Member of editorial staff submits article to web-based rule based information extraction system (SAS Teragram)

Page 40: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

NYT Curation Workflow

¨ Teragram uses linguistic extraction rules based on subset of Index Dept’s controlled vocab.

Page 41: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

NYT Curation Workflow

¨ Teragram suggests tags based on the Index vocabulary that can potentially describe the content of article

Page 42: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

NYT Curation Workflow

¨ Editorial staff member selects terms that best describe the contents and inserts new tags if necessary

Page 43: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

NYT Curation Workflow

¨ Reviewed by the taxonomy managers with feedback to editorial staff on classification process

Page 44: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

NYT Curation Workflow

¨ Article is published online at nytimes.com

Page 45: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

NYT Curation Workflow

¨ At later stage article receives second level curation by Index Dept. additional Index tags and a summary

Page 46: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

NYT Curation Workflow

¨ Article is submitted to NYT Index

Page 47: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

The New York Times

n  Early adopter of Linked Open Data (June ‘09)

Page 48: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

The New York Times

n  Linked Open Data @ data.nytimes.com ¨ Subset of 10,000 tags from index vocabulary

¨ Dataset of people, organizations & locations – Complemented by search services to consume

data about articles, movies, best sellers, Congress votes, real estate,…

n  Benefits ¨  Improves traffic by third party data usage

¨ Lowers development cost of new applications for different verticals inside the website

–  E.g. movies, travel, sports, books

Page 49: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

CROWDSOURCING  

PART  III  

7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  

Page 50: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   50

Crowdsourcing Landscape

Page 51: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Crowdsourcing Landscape

51

Page 52: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Introduction to Crowdsourcing

n  Coordinating a crowd (a large group of workers)to do micro-work (small tasks) that solves problems (that computers or a single user can’t)

n  A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve goals

n  Related Areas ¨  Collective Intelligence

¨  Social Computing

¨  Human Computation

¨  Data Mining

A. J. Quinn and B. B. Bederson, “Human computation: a survey and taxonomy of a growing field,” in Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, 2011, pp. 1403–1412.

Page 53: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

When Computers Were Human

n Maskelyne 1760 ¨ Used human computers

to created almanac of moon positions

– Used for shipping/navigation

¨ Quality assurance – Do calculations twice – Compare to third verifier

D. A. Grier, When Computers Were Human, vol. 13. Princeton University Press, 2005.

Page 54: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

When Computers Were Human

Page 55: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

20th Century

1926: Teleautomation “…when wireless is perfectly applied the whole earth will be converted into a huge brain.”

1948: Cybernetics “…communication and control theory that is concerned especially with

the comparative study of automatic control systems.”

Credits: Thierry Ehrmann (Flickr), Dr. Sabina Jeschke, Wikimedia Foundation

1961: Embedded systems “A system a dedicated function within a larger mechanical or electrical system, often with real-time computing constraints.”

1988: Ubiquitous computing “…advanced computing concept where computing is made to appear

everywhere and anywhere.”

Page 56: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

21st Century

Credits: Kevin Ashton, Amith Sheth, Helen Gill, Wikimedia Foundation

1999: Internet of Things “…to uniquely identifiable objects and their virtual representations in an Internet-like structure.”

2006: Cyber-physical systems “…communication and control theory that is concerned especially

with the comparative study of automatic control systems.”

2008: Web of Things “A set of blueprints to make every-day physical objects first class citizens of the World Wide Web by giving them an API.”

2012: Physical-Cyber-Social computing “a holistic treatment of data, information, and knowledge from the PCS worlds to integrate, correlate, interpret, and provide contextually relevant

abstractions to humans.”

Page 57: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Sensing

Credits: Albany Associates, stuartpilrow, Mike_n (Flickr)

Computation Actuation

Human Powered CPS

Leverages human capabilities in conjunction with machine capabilities for optimizing processes in the cyber-physical-social

environments

Page 58: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Human ü Visual perception ü Visuospatial thinking

ü Audiolinguistic ability ü Sociocultural

awareness

ü Creativity ü Domain knowledge

Machine ü Large-scale data

manipulation

ü Collecting and storing large amounts of data

ü Efficient data movement

ü Bias-free analysis

Human vs Machine Affordances

R. J. Crouser and R. Chang, “An affordance-based framework for human computation and human-computer collaboration,” IEEE Trans. Vis. Comput. Graph., vol. 18, pp. 2859–2868, 2012.

Page 59: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

When to Crowdsource a Task?

n Computers cannot do the task

n Single person cannot do the task

n Work can be split into smaller tasks

Page 60: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Platforms and Marketplaces

Page 61: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Types of Crowds

n  Internal corporate communities ¨ Taps potential of internal workforce

¨ Curate competitive enterprise data that will remain internal to the company

– May not always be the case e.g. product technical support and marketing data

n  External communities ¨  Public crowd-souring market places

¨  Pre-competitive communities

Page 62: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Generic Architecture

Workers

Platform/Marketplace (Publish Task, Task Management)

Requestors

1.

2.

4.

3.

Page 63: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Mturk Workflow

Page 64: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Spatial Crowdsourcing

n  Crowdsoucring that requires a person to travel to a location to preform a spatial task ¨  Helps non-local requesters through workers in targeted

spatial locality

¨  Used for data collection, package routing, citizen actuation

¨  Usually based on mobile applications

¨  Closely related to social sensing, participatory sensing, etc.

¨  Early example Ardavark social search en

n  Example systems

Page 65: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

CASE  STUDIES  ON  CROWDSOURCED  DATA  CURATION  PART  IV  

7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  

Page 66: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Crowdsourced Data Curation

DQ Rules & Algorithms

Entity Linking Data Fusion

Relation Extraction

Human Computation

Relevance Judgment

Data Verification Disambiguation

Clean Data Internal Community - Domain Knowledge - High Quality Responses - Trustable

Web of Data

Databases

Textual Content

Programmers Managers

External Crowd - High Availability - Large Scale - Expertise Variety

Page 67: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Examples of CDM Tasks

n  Understanding customer sentiment for launch of new product around the world.

n  Implemented 24/7 sentiment analysis system with workers from around the world.

n  90% accuracy in 95% on content

n  Categorize millions of products on eBay’s catalog with accurate and complete attributes

n  Combine the crowd with machine learning to create an affordable and flexible catalog quality system

Page 68: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Examples of CDM Tasks

n  Natural Language Processing ¨  Dialect Identification, Spelling Correction, Machine

Translation, Word Similarity

n  Computer Vision ¨  Image Similarity, Image Annotation/Analysis

n  Classification ¨  Data attributes, Improving taxonomy, search results

n  Verification ¨  Entity consolidation, de-duplicate, cross-check, validate

data

n  Enrichment ¨  Judgments, annotation

Page 69: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Wikipedia

n  Collaboratively built by large community ¨  More than 19,000,000 articles, 270+ languages,

3,200,000+ articles in English

¨  More than 157,000 active contributors

n  Accuracy and stylistic formality are equivalent to expert-based resources ¨  i.e. Columbia and Britannica encyclopedias

n  WikiMeida ¨  Software behind Wikipedia

¨  Widely used inside organizations

¨  Intellipedia:16 U.S. Intelligence agencies

¨  Wiki Proteins: curated Protein data for knowledge discovery

Page 70: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Wikipedia

n  Decentralized environment supports creation

of high quality information with: ¨ Social organization

¨ Artifacts, tools & processes for cooperative work coordination

n  Wikipedia collaboration dynamics highlight good practices

Page 71: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Wikipedia – Social Organization

n  Any user can edit its contents ¨ Without prior registration

n  Does not lead to a chaotic scenario ¨  In practice highly scalable approach for high

quality content creation on the Web

n  Relies on simple but highly effective way to coordinate its curation process

n  Curation is activity of Wikipedia admins ¨ Responsibility for information quality standards

Page 72: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Wikipedia – Social Organization

Page 73: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Wikipedia – Social Organization

n  Incentives ¨  Improvement of one’s reputation

¨ Sense of efficacy – Contributing effectively to a meaningful project

¨ Over time focus of editors typically change –  From curators of a few articles in specific topics

–  To more global curation perspective

–  Enforcing quality assessment of Wikipedia as a whole

Page 74: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Wikipedia – Artifacts, Tools & Processes

n  Wiki Article Editor (Tool) ¨  WYSIWYG or markup text editor

n  Talk Pages (Tool) ¨  Public arena for discussions

around Wikipedia resources

n  Watchlists (Tool) ¨  Helps curators to actively

monitor the integrity and quality of resources they contribute

n  Permission Mechanisms (Tool) ¨  Users with administrator status

can perform critical actions such as remove pages and grant administrative permissions to new users

n  Automated Edition (Tool) ¨  Bots are automated or semi-automated

tools that perform repetitive tasks over content

n  Page History and Restore (Tool) ¨  Historical trail of changes to a Wikipedia

Resource

n  Guidelines, Policies & Templates (Artifact) ¨  Defines curation guidelines for editors to

assess article quality

n  Dispute Resolution (Process) ¨  Dispute mechanism between editors

over the article contents

n  Article Edition, Deletion, Merging, Redirection, Transwiking, Archival (Process) ¨  Describe the curation actions over

Wikipedia resources

Page 75: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

DBPedia Knowledge base

n DBPedia provides direct access to data ¨ Indirectly uses wiki as data curation platform

¨ Inherits massive volume of curated Wikipedia data

¨ 3.4 million entities and 1 billion RDF triples

¨ Comprehensive data infrastructure – Concept URIs – Definitions – Basic types

Page 76: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Page 77: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Wikipedia - DBPedia

Page 78: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

n  Collaborative knowledge base maintained by community of web users

n  Users create entity types and their meta-data according to guidelines

n  Requires administrative approvals for schema changes by end users

Page 79: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Page 80: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Page 81: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Page 82: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Audio Tagging - Tag a Tune

Page 83: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Image Tagging - Peekaboom

Page 84: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Protein Folding - Fold.it/

Page 85: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

ReCaptcha

n  OCR ¨  ~ 1% error rate

¨  20%-30% for 18th and 19th century books

n  40 million ReCAPTCHAs every day” (2008) ¨  Fixing 40,000 books a

day

Page 86: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

ChemSpider

n  Structure centric chemical community ¨ Over 300 data sources with 25 million records

¨ Provided by chemical vendors, government databases, private laboratories and individual

n  Pharma realizing benefits of open data ¨ Heavily leveraged by pharmaceutical companies

as pre-competitive resources for experimental and clinical trial investigation

¨ Glaxo Smith Kline made its proprietary malaria dataset of 13,500 compounds available

Page 87: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

n  Dedicated to improving understanding of biological systems functions with 3-D structure of macromolecules ¨ Started in 1971 with 3 core members

¨ Originally offered 7 crystal structures

¨ Grown to 63,000 structures

¨ Over 300 million dataset downloads

n  Expanded beyond curated data download service to include complex molecular visualized, search, and analysis capabilities

Page 88: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

SETTING  UP  A  CROWDSOURCED  DATA  CURATION  PROCESS  PART  V  

7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  

Page 89: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Core Design Questions

Goal What

Why Incentives Who Workers

How Process

Malone, T. W., Laubacher, R., & Dellarocas, C. N. Harnessing crowds: Mapping the genome of collective intelligence. MIT Sloan Research Paper 4732-09, (2009).

Page 90: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Setting up a Curation Process

1 – Who is doing it?

2 – Why are they doing it?

3 – What is being done?

4 – How is it being done?

Page 91: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

1) Who is doing it? (Workers)

n  Hierarchy (Assignment) ¨ Someone in authority assigns a particular person

or group of people to perform the task

¨ Within the Enterprise (i.e. Individuals, specialised departments)

¨ Within a structured community (i.e. pre-competitive community)

n  Crowd (Choice) ¨ Anyone in a large group who choses to do so

¨  Internal or External Crowds

Page 92: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

2) Why are they doing it? (Incentives)

n  Motivation ¨  Money ($$££)

¨  Glory (reputation/prestige) ¨  Love (altruism, socialize, enjoyment) ¨  Unintended by-product (e.g. re-Captcha, captured in workflow)

¨  Self-serving resources (e.g. Wikipedia, product/customer data) ¨  Part of their job description (e.g. Data curation as part of role)

n  Determine pay and time for each task ¨  Marketplace: Delicate balance

–  Money does not improve quality but can increase participation

¨  Internal Hierarchy: Engineering opportunities for recognition –  Performance review, prizes for top contributors, badges,

leaderboards, etc.

Page 93: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Effect of Payment on Quality

n  Cost does not affect quality n  Similar results for bigger tasks [Ariely et al, 2009] Mason, W. A., & Watts, D. J. (2009). Financial incentives and the ‘‘performance of crowds.’’ Proceedings of the Human Computation Workshop. Paris: ACM, June 28, 2009.

[Panos Ipeirotis. WWW2011 tutorial]

Page 94: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

3) What is being done? (Goal)

3.1 Identify the Data ¨ Newly created data and/or legacy data?

¨ How is new data created? – Do users create the data, or is it imported from an

external source?

¨ How frequently is new data created/updated?

¨ What quantity of data is created?

¨ How much legacy data exists?

¨  Is it stored within a single source, or scattered across multiple sources?

Page 95: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

3) What is being done? (Goal)

3.2 Identify the Tasks ¨ Creation Tasks

– Create/Generate – Find – Improve/ Edit / Fix

¨ Decision (Vote) Tasks – Accept / Reject – Thumbs up / Thumbs Down – Vote for Best

Page 96: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

4) How is it being done? (How)

n  Identify the workflow ¨ Tasks integrated in normal workflow of those

creating and managing data

¨ Simple as vetting or “rating” results of algorithm

n  Identify the platform

¨  Internal/Community collaboration platforms

¨  Public crowdsourcing platform –  Consider the availability of appropriate workers (i.e. experts)

n  Identify the Algorithm ¨  Data quality

¨  Image recognition

¨  etc

Page 97: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

4) How is it being done? (How)

n  Task Design ¨ Task Interface

¨ Task Assignment/Routing

¨ Task Quality Assurance

Page 98: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Task Design

98

* Edith Law and Luis von Ahn, Human Computation - Core Research Questions and State of the Art

Input Output

Task Router before computation

Output Aggregation after computation

Task Interface during computation

Page 99: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Pull Routing

n  Workers seek tasks and assign to themselves ¨  Search and Discovery of tasks support by platform

¨  Task Recommendation

¨  Peer Routing

Workers

Tasks Select

Result

Algorithm

Search & Browse Interface

Result

Page 100: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Push Routing

n  System assigns tasks to workers based on: ¨  Past performance

¨  Expertise

¨  Cost

¨  Latency

100

Workers

Tasks

Assign

Result

Assign

Algorithm

Task Interface

* www.mobileworks.com

Result

Page 101: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Managing Task Quality Assurance

n  Redundancy: Quorum Votes ¨  Replicate the task (i.e. 3 times)

¨  Use majority voting to determine right value (% agreement)

¨  Weighted majority vote

n  Gold Data / Honey Pots ¨  Inject trap question to test quality

¨  Worker fatigue check (habit of saying no all the time)

n  Estimation of Worker Quality ¨  Redundancy plus gold data

n  Qualification Test ¨  Use test tasks to determine users ability for such tasks

Page 102: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Social Best Practices

n  Participation ¨  Stakeholders involvement for data producers and

consumers must occur early in project –  Provides insight into basic questions of what they want

to do, for whom, and what it will provide

n  Incentives ¨  Sheer curation needs line of sight from data curating

activity, to tangible exploitation benefits ¨  Recognizing contributing curators through a formal

feedback mechanism n  Engagement

¨ Outreach essential for promotion and feedback ¨  Typical consumers-to-contributors ratios < 5%

Page 103: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Technical Best Practices

n  Human & Automated Curation ¨ Automated curation should always defer to, and

never override, human curation edits – Automate validating data deposition and entry

– Target community at focused curation tasks

n  Track Provenance ¨ Curation activities should be recorded

–  Especially where human curators are involved

¨ Different perspectives of provenance – A scientist may need to evaluate the fine grained

experiment description behind the data – For a business analyst the ’brand’ of data provider

can be sufficient for determining quality

Page 104: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

LINKED  OPEN  DATA  EXAMPLE  

PART  VI  

7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  

Page 105: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Linked Open Data (LOD)

n  Expose and interlink datasets on the Web n  Using URIs to identify “things” in your data

n  Using a graph representation (RDF) to describe URIs n  Vision: The Web as a huge graph database

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Page 106: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Linked Data Example

MulDple  IdenDfiers  

IdenDty  resoluDon  links  

Page 107: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Identity Resolution in LOD

n  Quality issues with Linked Data ¨  Imprecise or Outdated or Wrong

n  Uncertainty of identity resolution links ¨  Due to multiple identity equivalence interpretations

¨  Due to characteristics of link generation algorithms (similarity based)

n  User feedback for uncertain links ¨  Verify uncertain identity resolution links from users/

experts

¨  Improve quality of entity consolidation

Page 108: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Identity Resolution in LOD

<hcp://www.freebase.com/view/en/galway>  

<hcp://dbpedia.org/resource/Galway>  

<hcp://sws.geonames.org/2964180/>  

owl:sameAs  

Publisher  

owl:sameAs  Consumer  

MulDple  IdenDfiers  for  ‘Galway’  enDty  in  Linked  Open  Data  Cloud  

Different  sources  of  idenDty  resoluDon  links  Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Page 109: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

LOD Application Architecture

Utility  Module

Feedback  Module

Consolidation  Module

QuestionsFeedbackRules

Matching Dependencies

Ranked Feedback Tasks

Data Improvement

Candidate Links

Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition), 1-136. Morgan & Claypool.

Page 110: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

FUTURE  RESEARCH  CHALLENGES  

PART  IIV  

7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  

Page 111: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Future Research Directions

n  Incentives and social engagement ¨  Better recognition of the data curation role ¨  Understanding of social engagement mechanisms

n  Economic Models ¨  Pre-competitive and public-private partnerships

n  Curation at Scale ¨  Evolution of human computation and crowdsourcing ¨  Instrumenting popular apps for data curation ¨  General-purpose data curation pipelines ¨  Human-data interaction

n  Trust ¨  Capture of data curation decisions & provenance management ¨  Fine-grained permission management models and tools

n  Data Curation Models ¨  Nanopulications ¨  Theoretical principles and domain-specific model

Page 112: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Future Research Directions

n  Spatial Crowdsourcing ¨  Matching tasks with workers at right time and location

¨  Balancing workload among workers

¨  Tasks at remote locations

¨  Chaining tasks in same vicinity

¨  Preserving worker privacy

n  Interoperability ¨  Finding semantic similarity of tasks across systems

¨  Defining and measuring worker capability across heterogeneous systems

¨  Enabling routing middleware for multiple systems

¨  Compatibility of reputation systems

¨  Defining standards for task exchange

Page 113: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Heterogeneous Crowds

n  Multiple requesters, tasks, workers, platform

Collaborative Data Curation

Tasks Workers

Cyber Physical Social System

Platforms

Page 114: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

SLUA Ontology

114

Reward

Action

Capability

User Task

offers earns

includes performs

requires possesses

Location Skill Knowledge Ability Availability

Reputation Money Fun Altruism Learning

subClassOf

subClassOf

U. ul Hassan, S. O’Riain, E. Curry, “SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing,” in International Workshop on Crowdsourcing the Semantic Web, 2013.

Page 115: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Future Research Directions

n  Task Routing ¨  Optimizing task completion, quality, and latency

¨  Inferring worker preferences, skills, and knowledge

¨  Balancing exploration-exploitation trade-off between inference and optimization

¨  Cold-start problem for new workers or tasks

¨  Ensuring worker satisfaction via load balancing & rewards

n  Human–Computer Interaction ¨  Reducing search friction through good browsing interfaces

¨  Presenting requisite information nothing more

¨  Choosing the level of task granularity for complex tasks

¨  Ensuring worker engagement

¨  Designing games with a purpose to crowd source with fun

Page 116: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Summary

Algorithms Humans Better Data Data

Page 117: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Selected References

n  Big Data & Data Quality ¨  S. Lavalle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the

Path from Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, pp. 21–32, 2011.

¨  A. Haug and J. S. Arlbjørn, “Barriers to master data quality,” Journal of Enterprise Information Management, vol. 24, no. 3, pp. 288–303, 2011.

¨  R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo, “Managing one master data – challenges and preconditions,” Industrial Management & Data Systems, vol. 111, no. 1, pp. 146–162, 2011.

¨  E. Curry, S. Hasan, and S. O’Riain, “Enterprise Energy Management using a Linked Dataspace for Energy Intelligence,” in Second IFIP Conference on Sustainable Internet and ICT for Sustainability, 2012.

¨  D. Loshin, Master Data Management. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2008.

¨  Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment.Communications of the ACM, 45(4), 211-2

¨  Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys (CSUR), 41(3), 16.

¨  B. Otto and A. Reichert, “Organizing Master Data Management: Findings from an Expert Survey,” in Proceedings of the 2010 ACM Symposium on Applied Computing - SAC ’10, 2010, pp. 106–110.

¨  Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 1996. 12(4): p. 5-33

¨  Ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Leveraging Matching Dependencies for Guided User Feedback in Linked Data Applications,” In 9th International Workshop on Information Integration on the Web (IIWeb2012) Scottsdale, Arizona,: ACM.

Page 118: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Selected References

n  Collective Intelligence, Crowdsourcing & Human Computation ¨  Malone, Thomas W., Robert Laubacher, and Chrysanthos Dellarocas. "Harnessing Crowds: Mapping the

Genome of Collective Intelligence." (2009). ¨  A. Doan, R. Ramakrishnan, and A. Y. Halevy, “Crowdsourcing systems on the World-Wide Web,”

Communications of the ACM, vol. 54, no. 4, p. 86, Apr. 2011. ¨  A. J. Quinn and B. B. Bederson, “Human computation: a survey and taxonomy of a growing field,” in

Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, 2011, pp. 1403–1412.

¨  Mason, W. A., & Watts, D. J. (2009). Financial incentives and the ‘‘performance of crowds.’’ Proceedings of the Human Computation Workshop. Paris: ACM, June 28, 2009.

¨  E. Law and L. von Ahn, “Human Computation,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 5, no. 3, pp. 1–121, Jun. 2011.

¨  M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, “CrowdDB : Answering Queries with Crowdsourcing,” in Proceedings of the 2011 international conference on Management of data - SIGMOD ’11, 2011, p. 61.

¨  P. Wichmann, A. Borek, R. Kern, P. Woodall, A. K. Parlikad, and G. Satzger, “Exploring the ‘Crowd’ as Enabler of Better Information Quality,” in Proceedings of the 16th International Conference on Information Quality, 2011, pp. 302–312.

¨  Winter A. Mason, Duncan J. Watts: Financial incentives and the "performance of crowds". SIGKDD Explorations (SIGKDD) 11(2):100-108 (2009)

¨  Panos Ipeirotis. Managing Crowdsourced Human Computation, WWW2011 Tutorial ¨  O. Alonso & M. Lease. Crowdsourcing 101: Putting the WSDM of Crowds to Work for You, WSDM Hong Kong

2011. ¨  D. A. Grier, When Computers Were Human, vol. 13. Princeton University Press, 2005.

–  http://www.youtube.com/watch?v=YwqltwvPnkw

¨  Ul Hassan, U., & Curry, E. (2013, October). A capability requirements approach for predicting worker performance in crowdsourcing. In Collaborative Computing: Networking, Applications and Worksharing (Collaboratecom), 2013 9th Internatinal Conference Conference on (pp. 429-437). IEEE.

Page 119: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Selected References

n  Collaborative Data Management ¨  E. Curry, A. Freitas, and S. O. Riain, “The Role of Community-Driven Data Curation for Enterprises,” in

Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25–47. ¨  Ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Towards Expertise Modelling for Routing Data Cleaning

Tasks within a Community of Knowledge Workers,” In 17th International Conference on Information Quality (ICIQ 2012), Paris, France.

¨  Ul Hassan, U., O’Riain, S., and Curry, E. 2013. “Effects of Expertise Assessment on the Quality of Task Routing in Human Computation,” In 2nd International Workshop on Social Media for Crowdsourcing and Human Computation, Paris, France.

¨  Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S. B., ... & Xu, S. (2013). Data Curation at Scale: The Data Tamer System. In CIDR.

¨  Parameswaran, A. G., Park, H., Garcia-Molina, H., Polyzotis, N., & Widom, J. (2012, October). Deco: declarative crowdsourcing. In Proceedings of the 21st ACM international conference on Information and knowledge management (pp. 1203-1212). ACM.

¨  Parameswaran, A., Boyd, S., Garcia-Molina, H., Gupta, A., Polyzotis, N., & Widom, J. (2014). Optimal crowd-powered rating and filtering algorithms.Proceedings Very Large Data Bases (VLDB).

¨  Marcus, A., Wu, E., Karger, D., Madden, S., & Miller, R. (2011). Human-powered sorts and joins. Proceedings of the VLDB Endowment, 5(1), 13-24.

¨  Guo, S., Parameswaran, A., & Garcia-Molina, H. (2012, May). So who won?: dynamic max discovery with the crowd. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (pp. 385-396). ACM.

¨  Davidson, S. B., Khanna, S., Milo, T., & Roy, S. (2013, March). Using the crowd for top-k and group-by queries. In Proceedings of the 16th International Conference on Database Theory (pp. 225-236). ACM.

¨  Chai, X., Vuong, B. Q., Doan, A., & Naughton, J. F. (2009, June). Efficiently incorporating user feedback into information extraction and integration programs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data (pp. 87-100). ACM.

Page 120: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Selected References

n  Spatial Crowdsourcing ¨  Kazemi, L., & Shahabi, C. (2012, November). Geocrowd: enabling query answering with spatial

crowdsourcing. In Proceedings of the 20th International Conference on Advances in Geographic Information Systems (pp. 189-198). ACM.

¨  Benouaret, K., Valliyur-Ramalingam, R., & Charoy, F. (2013). CrowdSC: Building Smart Cities with Large Scale Citizen Participation. IEEE Internet Computing, 1.

¨  Musthag, M., & Ganesan, D. (2013, April). Labor dynamics in a mobile micro-task market. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 641-650). ACM.

¨  Deng, Dingxiong, Cyrus Shahabi, and Ugur Demiryurek. "Maximizing the number of worker's self-selected tasks in spatial crowdsourcing." Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 2013.

¨  To, H., Ghinita, G., & Shahabi, C. (2014). A Framework for Protecting Worker Location Privacy in Spatial Crowdsourcing. Proceedings of the VLDB Endowment, 7(10).

¨  Goncalves, J., Ferreira, D., Hosio, S., Liu, Y., Rogstadius, J., Kukka, H., & Kostakos, V. (2013, September). Crowdsourcing on the spot: altruistic use of public displays, feasibility, performance, and behaviours. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing(pp. 753-762). ACM.

¨  Cardone, G., Foschini, L., Bellavista, P., Corradi, A., Borcea, C., Talasila, M., & Curtmola, R. (2013). Fostering participaction in smart cities: a geo-social crowdsensing platform. Communications Magazine, IEEE, 51(6).

Page 121: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Books

n  Surowiecki, J. (2005). The wisdom of crowds. Random House LLC.

n  Batini, C., & Scannapieco, M. (2006). Data quality: concepts, methodologies and techniques. Springer.

n  Michelucci, P. (2013). Handbook of human computation. Springer.

n  Law, E., & Ahn, L. V. (2011). Human computation. Synthesis Lectures on Artificial Intelligence and Machine Learning, 5(3), 1-121.

n  Heath, T., & Bizer, C. (2011). Linked data: Evolving the web into a global data space. Synthesis lectures on the semantic web: theory and technology, 1(1), 1-136.

n  Grier, D. A. (2013). When computers were human. Princeton University Press.

n  Easley, D., & Kleinberg, J. Networks, Crowds, and Markets. Cambridge University.

n  Sheth, A., & Thirunarayan, K. (2012). Semantics Empowered Web 3.0: Managing Enterprise, Social, Sensor, and Cloud-based Data and Services for Advanced Applications. Synthesis Lectures on Data Management, 4(6), 1-175.

Page 122: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Tutorials

n  Human Computation and Crowdsourcing ¨  http://research.microsoft.com/apps/video/default.aspx?id=169834

¨  http://www.youtube.com/watch?v=tx082gDwGcM

n  Human-Powered Data Management ¨  http://research.microsoft.com/apps/video/default.aspx?id=185336

n  Crowdsourcing Applications and Platforms: A Data Management Perspective ¨  http://www.vldb.org/pvldb/vol4/p1508-doan-tutorial4.pdf

n  Human Computation: Core Research Questions and State of the Art ¨  http://www.humancomputation.com/Tutorial.html

n  Crowdsourcing & Machine Learning ¨  http://www.cs.rutgers.edu/~hirsh/icml-2011-tutorial/

n  Data quality and data cleaning: an overview ¨  http://dl.acm.org/citation.cfm?id=872875

Page 123: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Datasets

n  TREC Crowdsourcing Track ¨  https://sites.google.com/site/treccrowd/

n  2010 Crowdsourced Web Relevance Judgments Data ¨  https://docs.google.com/document/d/

1J9H7UIqTGzTO3mArkOYaTaQPibqOTYb_LwpCpu2qFCU/edit

n  Statistical QUality Assurance Robustness Evaluation Data ¨  http://ir.ischool.utexas.edu/square/data.html

n  Crowdsourcing at Scale 2013 ¨  http://www.crowdscale.org/

n  USEWOD - Usage Analysis and the Web of Data ¨  http://usewod.org/usewodorg-2.html

n  NAACL 2010 Workshop ¨  https://sites.google.com/site/amtworkshop2010/data-1

n  mturk-tracker.com

n  GalaxyZoo.com

n  CrowdCrafting.com

Page 124: Crowdsourcing Approaches to Big Data Curation for Earth Sciences

EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  

Credits

Special thanks to Umair ul Hassan for his assistance with the Tutorial

EarthBiAs2014


Recommended