56
ICDE2009 Keynotes Summ ary Shanghai, China, 3.2 9-4.2 Li Yukun

ICDE2009 Keynotes Summary

Embed Size (px)

DESCRIPTION

ICDE2009 Keynotes Summary. Shanghai, China, 3.29-4.2 Li Yukun. Outline. Keynotes Search Computing( Stefano Ceri ) Data Management in the Cloud( Raghu Ramakrishnan) Why Can't I Find My Data the Way I Find My Dinner? David Carlson. Keynote 1. Search Computing Stefano Ceri - PowerPoint PPT Presentation

Citation preview

ICDE2009 Keynotes Summary

Shanghai, China, 3.29-4.2

Li Yukun

Outline

Keynotes Search Computing(Stefano Ceri) Data Management in the Cloud(Raghu Ramakrishn

an) Why Can't I Find My Data the Way I Find My Dinner?

David Carlson

Keynote 1

Search ComputingStefano Ceri

Dipartimento di Elettronica e Informazione, Politecnico di Milano

Piazza L. Da Vinci 32, 20133 Milano, Italy

[email protected]

Motivation

“Who are the strongest European competitors on software ideas?

Who is the best doctor to cure insomnia in a nearby hospital?

Where can I attend an interesting conference in my field close to a sunny beach?”

This information is available on the Web, but no software system can accept such queries nor compute the answer.

Core model for search computing

Conventional services Are abstracted as systems producing sets of equal-weight answers;

Service computing A cross-discipline that covers the science and technology of

bridging the gap between Business Services and IT Services. The goal of Services Computing is to enable IT services and

computing technology to perform business services more efficiently and effectively.

Search services Can be abstracted as systems producing ranked lists of answers.

Search computing It is a new paradigm where ranking is the dominant factor for composing

services. Multi-domain query, constellation of cooperating search services,

possibly dynamically selected,

CHAPTERS OF SEARCH COMPUTING

Theory for search computing Select the best abstractions covering the concepts Design basic operations on services and algorithms Compute time and space complexity

Statistical models for search services Build statistical estimators of the number and quality of the results

Optimization methods for search computing Description abstractions for search services

Expose ranking-specific properties of search services

Language abstractions for search computing by incorporating the ranking aspects and strategies for dealing with rankings

CHAPTERS OF SEARCH COMPUTING

Human-computer interfaces Expressing ranking preferences. Light-weight user interaction

Semantics Merging the results of heterogeneous search services semantic “join” of search services.

Higher-order ranking “ranking of rankings”, is essential for selecting and prioritizing

search services. A multi-level one,

Managing individual and social searching search strategies to user profiling or to past user interactions Societal recommendation and evaluation

Thus, individual and societal aspects are key ingredients for search computing

CHAPTERS OF SEARCH COMPUTING

Search computing engineering designing, assembling and deploying search computing software

applications. Economy of search computing

Suitable business models, based upon advertising schemes, pay-per-query, subscription fees, micro-billing, and so on.

Security and privacy of search computing control of how data is used. For instance, use of a search service could be granted to a servi

ce computing application, provided that the service’s owners can trace all queries involving their data and limit the kind of information that is made visible to the queries.

PROJECT ORGANIZATION

Funded by the European Research Council in the framework of the IDEAS Advanced Grants;

It started on Nov. 1, 2008 and will last five years.

PROJECT ORGANIZATION

The project involves about 30 researchers at Politecnico

Abdan Abid, Edoardo Amaldi, Alessandro Bozzon, Daniele Maria Braga, Marco Brambilla, Tommaso Buganza, Alessandro Campi, Sofia Ceppi, Sara Comai, Emanuele Della Valle, Piero Fraternali, Nicola Gatti, Michael Grossniklaus, Ma’moun Abu Hellu, Pier Luca Lanzi, Davide Martinenghi, Marco Masseroli, Maristella Matera, Davide Mazza, Giuseppe Pozzi, Stefania Ronchi, Roberto Verganti, Marco Tagliasacchi, Massimo Tisi.

SeCo has an advisory board Edoardo Amaldi (Operations Research), Fabio Casati (Service Computing), Georg Gottlob (Theory), Ioana Manolescu (Systems and Performance), Roberto Verganti (Business Models), Gerhard Weikum (Information Retrieval for the Web), Jennifer Widom (Languages and Paradigms)

seven teams

Concept teamTheory and methodsService registration and managementQuery processingInteraction designTools and prototypesBusiness models and technology watch

More information on SeCo is available on the project’s Web site: http://home.dei.polimi.it/ceri/seco/index.html

Outline

Keynotes Search Computing

Stefano Ceri Data Management in the Cloud

Raghu Ramakrishnan Why Can't I Find My Data the Way I Find My Dinner?

David Carlson

Keynote 2: Data Management in the Cloud

Yahoo! Research

Raghu Ramakrishnan Brian Cooper Utkarsh Srivastava Adam Silberstein Nick Puz Rodrigo Fonseca

CCDI

Chuck Neerdaels P.P.S. Narayan Kevin Athey Toby Negrin Plus Dev/QA teams

SCENARIOSPie-in-the-sky

Living in the Clouds

We want to start a new website, FredsList.com

Our site will provide listings of items for sale, jobs, etc.

As time goes on, we’ll add more features illustrate how more cloud capabilities are used

as needed List of capabilities/components is illustrative, n

ot exhaustive

Step 1: Listings

Simple Web Service API’s Simple Web Service API’s

Database

Sherpa

FredsList.com application FredsList.com application

1234323, transportation, For sale: one bicycle, barely used

FredsList wants to store listings as (key, category, description)

5523442, childcare, Nanny available in San Jose

215534, wanted, Looking for issue 1 of Superman comic book

DECLARE DATASET Listings AS( ID String PRIMARY KEY,Category String,Description Text )

DECLARE DATASET Listings AS( ID String PRIMARY KEY,Category String,Description Text )

Step 2: Search

Simple Web Service API’s Simple Web Service API’s

Database

Sherpa

“bicycle”

FredsList’s customers quickly ask for keyword search

Search

Vespa

“dvd’s” “nanny”

MessagingYMB

FredsList.com application FredsList.com application

ALTER ListingsSET Description SEARCHABLE

ALTER ListingsSET Description SEARCHABLE

Step 3: Photos

Simple Web Service API’s Simple Web Service API’s

Database

Sherpa

FredsList decides to add photos to listings

Search

Vespa

MessagingYMB

Storage

MObStorForeign key

photo → listing

FredsList.com application FredsList.com application

ALTER ListingsADD Photo BLOB

ALTER ListingsADD Photo BLOB

Step 4: Data Analysis

Simple Web Service API’s Simple Web Service API’s

Database

Sherpa

FredsList wants to analyze its listings to get statistics about category, do geocoding, etc.

Search

Vespa

MessagingYMB

Storage

MObStorForeign key

photo → listing

FredsList.com application FredsList.com application

ALTER ListingsMAKE ANALYZABLE

ALTER ListingsMAKE ANALYZABLE

Compute

Grid

Batch export

Pig query to analyze categories

Hadoop program to geocode data

Hadoop program to generate fancy pages for listings

Step 5: Performance

Simple Web Service API’s Simple Web Service API’s

Database

Sherpa

FredsList wants to reduce its data access latency

Search

Vespa

MessagingYMB

Storage

MObStorForeign key

photo → listing

FredsList.com application FredsList.com application

ALTER ListingsMAKE CACHEABLE

ALTER ListingsMAKE CACHEABLE

Compute

Grid

Batch export

Caching

memcached

EYES TO THE SKIESMotherhood-and-Apple-Pie

Requirements for Cloud Services

Multitenant A cloud service must support multiple, organizationally distant customers.

Elasticity Tenants should be able to negotiate and receive resources/QoS on-demand.

Resource Sharing Ideally, spare cloud resources should be transparently applied when a tenant’s nego

tiated QoS is insufficient. Horizontal scaling

It should be possible to add cloud capacity in small increments; this should be transparent to the tenants

Metering A cloud service must support accounting that reasonably ascribes operational and c

apital expenditures to each of the tenants of the service. Security

A cloud service should be secure in that tenants are not made vulnerable because of loopholes in the cloud.

Availability A cloud service should be highly available.

Operability A cloud service should be easy to operate

Types of Cloud Services

Two kinds of cloud services: Horizontal Cloud Services

Functionality enabling tenants to build applications or new services on top of the cloud

Functional Cloud Services Functionality that is useful in and of itself to tenants. E.g.,

various SaaS instances, such as Saleforce.com; Google Analytics and Yahoo!’s IndexTools; Yahoo! properties aimed at end-users and small businesses, e.g., flickr, Groups, Mail, News, Shopping

Yahoo! has been offering these for a long while (e.g., Mail for SMB, Groups, Flickr, BOSS, Ad exchanges)

SHERPA

To Help You Scale Your Mountains of Data

The Sherpa Solution

The next generation global-scale record store

Record-orientation: Routing, data storage optimized for low-latency record access

Scale out: Add machines to scale throughput (while keeping latency low)

Asynchrony: Pub-sub replication to far-flung datacenters to mask propagation delay

Consistency model: Reduce complexity of asynchrony for the application programmer

Cloud deployment model: Hosted, managed service to reduce app time-to-market and enable on demand scale and elasticity

26

QUERY PROCESSING

27

Accessing Data

28

SUSU SU

1

Get key k

2Get key k3 Record for key k

4 Record for key k

Bulk Read

29

SUScatter/gather server

SU SU

1

{k1, k2, … kn}

2Get k1

Get k2Get k3

Storage unit 1 Storage unit 2 Storage unit 3

Range Queries in YDOT

Clustered, ordered retrieval of records

Storage unit 1Canteloupe

Storage unit 3Lime

Storage unit 2Strawberry

Storage unit 1

Router

AppleAvocadoBananaBlueberry

CanteloupeGrapeKiwiLemon

LimeMangoOrange

StrawberryTomatoWatermelon

AppleAvocadoBananaBlueberry

CanteloupeGrapeKiwiLemon

LimeMangoOrange

StrawberryTomatoWatermelon

Grapefruit…Pear?Grapefruit…Lime?

Lime…Pear?

Storage unit 1Canteloupe

Storage unit 3Lime

Storage unit 2Strawberry

Storage unit 1

Updates

1

Write key k

2Write key k7 Sequence # for key k

8 Sequence # for key k

SU SU SU

3Write key k

4

5SUCCESS

6Write key k

RoutersMessage brokers

31

ASYNCHRONOUS REPLICATION AND CONSISTENCY

32

Asynchronous Replication

33

Goal: make it easier for applications to reason about updates and cope with asynchrony

What happens to a record with primary key “Brian”?

Consistency Model

34

Time

Record inserted

Update Update Update UpdateUpdate Delete

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Update Update

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Current version

Stale versionStale version

Read

Consistency Model

35

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Read up-to-date

Current version

Stale versionStale version

Consistency Model

36

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Read ≥ v.6

Current version

Stale versionStale version

Consistency Model

37

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Write

Current version

Stale versionStale version

Consistency Model

38

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Write if = v.7

ERROR

Current version

Stale versionStale version

Consistency Model

39

Index Maintenance

How to have lots of interesting indexes, without killing performance?

Solution: Asynchrony! Indexes updated asynchronously when base

table updated

Planned functionalityPlanned functionality

SHERPAIN CONTEXT

42

43

MObStor

Yahoo!’s next-generation globally replicated, virtualized media object storage service

Better provisioning, easy migration, replication, better BCP, and performance

New features (Evergreen URLs, CDN integration, REST API, …)

The object metadata problem is addressed using Sherpa, though MObStor is focused on blob storage.

Storage & Delivery Stack

The World Has Changed

Web applications need Scalability! Geographic distribution High availability Reliable storage

Web applications be unfit for Complicated queries Strong transactions

Web Data Management

Large data analysis(Hadoop)

Structured record storage

(PNUTS)

Blob storage(SAN/NAS)

• Scan oriented workloads

• Focus on sequential disk I/O

• $ per cpu cycle

• CRUD • Point lookups

and short scans

• Index organized table and random I/Os

• $ per latency

• Object retrieval and streaming

• Scalable file storage

• $ per GB

Application Design Space

Records Files

Get a few things

Scan everything

Sherpa MObStor

Everest Hadoop

YMDBMySQL

Filer

Oracle

BigTable

47

Further Reading

Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008)Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, Raghu Ramakrishnan

PNUTS: Yahoo!'s Hosted Data Serving Platform (VLDB 2008)Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni

Outline

Keynotes Search Computing(Stefano Ceri) Data Management in the Cloud(Raghu Ramakrishn

an) Why Can't I Find My Data the Way I Find My Dinner?

David Carlson

Keynote 3

Why Can’t I Find My Data the

Way I Find My Dinner?

David Carlson Director International Polar Year International Programme Office Cambridge, UK [email protected]

International Polar Year(IPY)

One can find almost every discipline represented in the IPY projects, and funding has come from geophysical, biological and social agencies and programs.

IPY data

open access data policy display and access of IPY data We have component systems, within nations, dis

ciplines, or existingdata service centers, that provide access examples for portions of the IPY data set.

We have unprecedented bandwidth for real-time data transmission

But , How to access these data set easily!!!

enormous challenges

financialsocial and technical barriers

this talk focuses on the latter.

Example

To understand and predict the health of migratory bird populations in the polar environment, Need ornithological, toxicological, ecological, met

eorological, hydrological, climatological, geomagnetic, and sociological data.

These data will cover a broad range of space and times scales, often in disparate (or at least inconsistent) space and time coordinate system

Problems

Data access For a larger population of curious users, the specialized

data services associated with subsets of the IPY data will not provide easy, friendly, or even accessible

Interfaces No familiar interfaces will provide integrated discovery

and browse services. No long-term plan

On longer time scales, and even as data storage capabilities grow rapidly, most of the IPY data sets donot, at present, have acceptable long-term archive plans, even for passive storage without continued discovery services.

Research issues

smart search engines pattern recognition data mining tools multi-gigabyte personal storage devices Advanced animation capabilities coupled with almost unlimited mobile bandwidth offer many citizens expansive and amazing access to commercial, r

ecreational, financial, and personal data and data services.

What changes in strategy, technology, funding and individual and collective behavior need to occur in the world of scientific data to allow me to browse, view and access IPY data on my iTouch?

Thanks