Crowdsourcing Approaches to Big Data Curation for Earth Sciences

  • Published on
    11-Aug-2014

  • View
    677

  • Download
    23

Embed Size (px)

Transcript

EarthBiAs2014 Global NEST University of the Aegean Crowdsourcing Approaches to Big Data CuraDon for Earth Sciences Insight Centre for Data AnalyDcs, NaDonal University of Ireland Galway EarthBiAs2014 1 Take Home Algorithms Humans Better DataData Talk Overview Part I: Mo=va=on Part II: Data Quality And Data Cura=on Part III: Crowdsourcing Part IV: Case Studies on Crowdsourced Data Cura=on Part V: SeIng up a Crowdsourced Data Cura=on Process Part VI: Linked Open Data Example Part IIV: Future Research Challenges 7-11 July 2014, Rhodes, Greece EarthBiAs2014 MOTIVATION PART I 7-11 July 2014, Rhodes, Greece EarthBiAs2014 BIG Big Data Public Private Forum THE BIG PROJECT Overall objective Bringing the necessary stakeholders into a self-sustainable industry-led initiative, which will greatly contribute to enhance the EU competitiveness taking full advantage of Big Data technologies. Work at technical, business and policy levels, shaping the future through the positioning of IIM and Big Data specifically in Horizon 2020. BIGBig Data Public Private Forum BIG Big Data Public Private Forum SITUATING BIG DATA IN INDUSTRY Health Public Sector Finance & Insurance Telco, Media& Entertainment Manufacturing, Retail, Energy, Transport Needs Offerings Value Chain Technical Working Groups Industry Driven Sectorial Forums Data Acquisition Data Analysis Data Curation Data Storage Data Usage Structured data Unstructured data Event processing Sensor networks Protocols Real-time Data streams Multimodality Stream mining Semantic analysis Machine learning Information extraction Linked Data Data discovery Whole world semantics Ecosystems Community data analysis Cross-sectorial data analysis Data Quality Trust / Provenance Annotation Data validation Human-Data Interaction Top-down/Bottom-up Community / Crowd Human Computation Curation at scale Incentivisation Automation Interoperability In-Memory DBs NoSQL DBs NewSQL DBs Cloud storage Query Interfaces Scalability and Performance Data Models Consistency, Availability, Partition- tolerance Security and Privacy Standardization Decision support Prediction In-use analytics Simulation Exploration Visualisation Modeling Control Domain-specific usage BIG Big Data Public Private Forum SUBJECT MATTER EXPERT INTERVIEWS BIG Big Data Public Private Forum KEY INSIGHTS Key Trends Lower usability barrier for data tools Blended human and algorithmic data processing for coping with for data quality Leveraging large communities (crowds) Need for semantic standardized data representation Significant increase in use of new data models (i.e. graph) (expressivity and flexibility) Much of (Big Data) technology is evolving evolutionary But business processes change must be revolutionary Data variety and verifiability are key opportunities Long tail of data variety is a major shift in the data landscape The Data Landscape Lack of Business-driven Big Data strategies Need for format and data storage technology standards Data exchange between companies, institutions, individuals, etc. Regulations & markets for data access Human resources: Lack of skilled data scientists Biggest Blockers Technical White Papers available on: http://www.big-project.eu EarthBiAs2014 7-11 July 2014, Rhodes, Greece The Internet of Everything: Connecting the Unconnected EarthBiAs2014 7-11 July 2014, Rhodes, Greece Earth Science Systems of Systems EarthBiAs2014 7-11 July 2014, Rhodes, Greece EarthBiAs2014 7-11 July 2014, Rhodes, Greece CiDzen Sensors humans as ci,zens on the ubiquitous Web, ac,ng as sensors and sharing their observa,ons and views Sheth, A. (2009). Ci=zen sensing, social signals, and enriching human experience. Internet Compu,ng, IEEE, 13(4), 87-92. Air Pollution EarthBiAs2014 7-11 July 2014, Rhodes, Greece EarthBiAs2014 7-11 July 2014, Rhodes, Greece EarthBiAs2014 7-11 July 2014, Rhodes, Greece Citizens as Sensors EarthBiAs2014 7-11 July 2014, Rhodes, Greece 16 of XYZ Haklay, M., 2013, Citizen Science and Volunteered Geographic Information overview and typology of participation in Sui, D.Z., Elwood, S. and M.F. Goodchild (eds.), 2013. Crowdsourcing Geographic Knowledge: Volunteered Geographic Information (VGI) in Theory and Practice . Berlin: Springer. DATA QUALITY AND DATA CURATION PART II 7-11 July 2014, Rhodes, Greece EarthBiAs2014 EarthBiAs2014 7-11 July 2014, Rhodes, Greece The Problems with Data Knowledge Workers need: Access to the right data Confidence in that data Flawed data effects 25% of critical data in worlds top companies Data quality role in recent financial crisis: Asset are defined differently in different programs Numbers did not always add up Departments do not trust each others figures Figures not worth the pixels they were made of EarthBiAs2014 7-11 July 2014, Rhodes, Greece What is Data Quality? Desirable characteristics for information resource Described as a series of quality dimensions: n Discoverability & Accessibility: storing and classifying in appropriate and consistent manner n Accuracy: Correctly represents the real-world values it models n Consistency: Created and maintained using standardized definitions, calculations, terms, and identifiers n Provenance & Reputation: Track source & determine reputation Includes the objectivity of the source/producer Is the information unbiased, unprejudiced, and impartial? Or does it come from a reputable but partisan source? Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 1996. 12(4): p. 5-33. EarthBiAs2014 7-11 July 2014, Rhodes, Greece Data Quality ID PNAME PCOLOR PRICE APNR iPod Nano Red 150 APNS iPod Nano Silver 160 150 5=on> Source A Source B Schema Difference? Data Developer APNR iPod Nano Red 150 APNR iPod Nano Silver 160 iPod Nano IPN890 150 5 Value Conflicts? Entity Duplication? Data Steward Business Users ? Technical Domain (Technical) Domain EarthBiAs2014 7-11 July 2014, Rhodes, Greece What is Data Curation? n Digital Curation Selection, preservation, maintenance, collection, and archiving of digital assets n Data Curation Active management of data over its life-cycle n Data Curators Ensure data is trustworthy, discoverable, accessible, reusable, and fit for use Museum cataloguers of the Internet age EarthBiAs2014 7-11 July 2014, Rhodes, Greece Related Activities n Data Governance/ Master Data Management Convergence of data quality, data management, business process management, and risk management Part of overall data governance strategy for organization n Data Curator = Data Steward EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation n Multiple approaches to curate data, no single correct way Who? Individual Curators Curation Departments Community-based Curation How? Manual Curation (Semi-)Automated Sheer Curation EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation Who? n Individual Data Curators Suitable for infrequently changing small quantity of data (million records) Availability: Post-hoc nature creates delay in curated data availability EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation - Who? n Community-Based Data Curation Decentralized approach to data curation Crowd-sourcing the curation process Leverages community of users to curate data Wisdom of the community (crowd) Can scale to millions of records EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation How? n Manual Curation Curators directly manipulate data Can tie users up with low-value add activities n (Sem-)Automated Curation Algorithms can (semi-)automate curation activities such as data cleansing, record duplication and classification Can be supervised or approved by human curators EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation How? n Sheer curation, or Curation at Source Curation activities integrated in normal workflow of those creating and managing data Can be as simple as vetting or rating the results of a curation algorithm Results can be available immediately n Blended Approaches: Best of Both Sheer curation + post hoc curation department Allows immediate access to curated data Ensures quality control with expert curation EarthBiAs2014 7-11 July 2014, Rhodes, Greece Data Quailty Data Curation Example Profile Sources Define Mappings Cleans Enrich De-duplicate Define Rules Curated Data Data Developer Data Curator Data Governance Business Users Applications Product DataProduct Data EarthBiAs2014 7-11 July 2014, Rhodes, Greece Data Curation n Pros Can create a single version of truth Standardized information creation and management Improves data quality n Cons Significant upfront costs and efforts Participation limited to few (mostly) technical experts Difficult to scale for large data sources Extended Enterprise e.g. partner, data vendors Small % of data under management (i.e. CRM, Product, ) EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times 100 Years of Expert Data Curation EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Largest metropolitan and third largest newspaper in the United States n nytimes.com q Most popular newspaper website in US n 100 year old curated repository defining its participation in the emerging Web of Data EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Data curation dates back to 1913 Publisher/owner Adolph S. Ochs decided to provide a set of additions to the newspaper n New York Times Index Organized catalog of articles titles and summaries Containing issue, date and column of article Categorized by subject and names Introduced on quarterly then annual basis n Transitory content of newspaper became important source of searchable historical data Often used to settle historical debates EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Index Department was created in 1913 Curation and cataloguing of NYT resources Since 1851 NYT had low quality index for internal use n Developed a comprehensive catalog using a controlled vocabulary Covering subjects, personal names, organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summaries n Current Index Dept. has ~15 people EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Challenges with consistently and accurately classifying news articles over time Keywords expressing subjects may show some variance due to cultural or legal constraints Identities of some entities, such as organizations and places, changed over time n Controlled vocabulary grew to hundreds of thousands of categories Adding complexity to classification process EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Increased importance of Web drove need to improve categorization of online content n Curation carried out by Index Department Library-time (days to weeks) Print edition can handle next-day index n Not suitable for real-time online publishing nytimes.com needed a same-day index EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Introduced two stage curation process Editorial staff performed best-effort semi- automated sheer curation at point of online pub. Several hundreds journalists Index Department follow up with long-term accurate classification and archiving n Benefits: Non-expert journalist curators provide instant accessibility to online users Index Department provides long-term high- quality curation in a trust but verify approach EarthBiAs2014 7-11 July 2014, Rhodes, Greece NYT Curation Workflow Curation starts with article getting out of the newsroom EarthBiAs2014 7-11 July 2014, Rhodes, Greece NYT Curation Workflow Member of editorial staff submits article to web-based rule based information extraction system (SAS Teragram) EarthBiAs2014 7-11 July 2014, Rhodes, Greece NYT Curation Workflow Teragram uses linguistic extraction rules based on subset of Index Depts controlled vocab. EarthBiAs2014 7-11 July 2014, Rhodes, Greece NYT Curation Workflow Teragram suggests tags ba...