Making Sense at Scale with Algorithms, Machines & People Michael Franklin EECS, Computer Science UC Berkeley Emory U December 7, 2011

Making Sense at Scale withAlgorithms, Machines & People

Michael FranklinEECS, Computer Science

UC Berkeley

Emory UDecember 7, 2011

Defining the Big Data Problem

2

Size

+

Complexity

=Answers that don’t meet quality, time and cost requirements.

The State of the Art

Algorithms

Machines

People

search

Watson/IBM

3

Needed: A Holistic Approach

search

Watson/IBM

Machines

People

Algorithms

4

AMP Team• 8 (primary) Faculty at Berkeley

• Databases, Machine Learning, Networking, Security, Systems, …

• 4 Partner ApplicationsParticipatory Sensing

Mobile Millenium (Alex Bayen, Civ Eng)

Collective Discovery

Opinion Space (Ken Goldberg, IEOR)

Urban Planning and Simulation

UrbanSim (Paul Waddell, Env. Des.)

Cancer Genomics/Personalized Medicine (Taylor Sittler, UCSF)

5

Big Data Opportunity• The Cancer Genome Atlas (TCGA)

– 20 cancer types x 500 patients each x 1 tumor genome + 1 normal genome = 5 petabytes

– David Haussler (UCSC) Datacenter online 12/11?– Intel donate AMP Lab cluster, put it next to TCGA

6

Slide from David Haussler, UCSC, “Cancer Genomics,” AMP retreat, 5/24/11

Berkeley Systems Lab Model

7

Industrial Collaboration:

“Two Feet In” Model:

Berkeley Data Analytics System (BDAS)

8

Infra. Builder

Algo/Tools

Data Collector

Data Analyst

Higher Query Languages / Processing Frameworks

Resource Management

StorageData

CollectorCrowd

Interface

Analytics Libraries, Data Integration

Data Source Selector

Result Control Center

Visualization

Qua

lity

Con

trol

Mon

itorin

g/D

ebug

ging

A Top-to-Bottom Rethinking of the Big Data Analytics Stack integratingAlgorithms, Machines, and People

Algos: More Data = Better Answers

Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound)

Error bars on every answer!

Est

imat

e

# of data points

true answer

Towards ML/Systems Co-Design

• Some ingredients of a system that can estimate and manage statistical risk:– Distributed bootstrap (bag of little bootstraps; BLB)– Subsampling (stratified)– Active sampling (cf. crowd-sourcing)– Bias estimation (especially with crowd-sourced data)– Distributed optimization– Streaming versions of classical ML algorithms– Streaming distributed bootstrap

• These all must be scalable, and robust

Machines Agenda• New software stack to

• Effectively manage cluster resources• Effectively extract value out of big data

• Projects:• “Datacenter OS”

• Extend Mesos distributed resource manager

• Common Runtime• Structured, Unstructured, Streaming, Sampling, …

• New Processing frameworks & storage systems• E.g., Spark – parallel environment for iterative algorithms

• Example: Quicksilver Query Processor• Allows users to navigate tradeoff space (quality,

time, and cost) for complex queries11

QuickSilver: Where do We Want to Go?

12

“simple” queries on PBs of data take hoursToday:

sub-second arbitrary queries on PBs of dataIdeal:

compute complex queries on PBs of data in < x seconds with < y% error

Goal:

People• Make people an integrated part of

the system!• Leverage human activity• Leverage human intelligence

(crowdsourcing)

Use the crowd to:

Find missing data

Integrate data

Make subjective comparisons

Recognize patterns

Solve problems13

Machines + Algorithms

data

, ac

tivity

Que

stio

ns Answ

ers

Human-Tolerant ComputingPeople throughout the analytics lifecycle?

• Inconsistent answer quality• Incentives• Latency & Variance• Open vs. Closed world• Hybrid Human/Machine Design

Approaches:• Statistical Methods for error and bias• Quality-conscious Interface design• Cost (time, quality)-based optimization

14

CROWDSOURCING EXAMPLES

Citizen ScienceNASA “Clickworkers” circa 2000

Citizen Journalism/Participatory Sensing

17

Expert Advice

Data collection

• Freebase

One View of Crowdsourcing

From Quinn & Bederson, “Human Computation: A Survey and Taxonomy of a Growing Field”, CHI 2011.

Industry View

Participatory Culture - Explicit

22

Participatory Culture – Implicit

23

John Murrell: GM SV 9/17/09…every time we use a Google app or service, we are working on behalf of the search sovereign, creating more content for it to index and monetize or teaching it something potentially useful about our desires, intentions and behavior.

Types of Tasks

Task Granularity Examples

Complex Tasks • Build a website• Develop a software system• Overthrow a government?

Simple Projects • Design a logo and visual identity• Write a term paper

Macro Tasks • Write a restaurant review• Test a new website feature• Identify a galaxy

Micro Tasks • Label an image• Verify an address• Simple entity resolution

Inspired by the report: “Paid Crowdsourcing”, Smartsheet.com, 9/15/2009

Amazon Mechanical Turk (AMT)

A Programmable Interface

26

• Amazon Mechanical Turk API• Requestors place Human Intelligence Tasks

(HITs) via “createHit()” API• Parameters include: #of replicas, expiration, User

Interface,…

• Requestors approve jobs and payment “getAssignments()”, “approveAssignments()”

• Workers (a.k.a. “turkers”) choose jobs, do them, get paid

Worker’s View

Requestor’s Veiew

CrowdDB: A Radical New Idea?

“The hope is that, in not too many years, human brains and computing machines will be coupled together very tightly, and that the resulting partnership will think as no human brain has ever thought and process data in a way not approached by the information-handling machines we know today.”

Problem: DB-hard Queries

30

SELECT Market_CapFrom CompaniesWhere Company_Name = “IBM”

Number of Rows: 0

Problem: Entity Resolution

Company_Name Address Market Cap

Google Googleplex, Mtn. View CA $170Bn

Intl. Business Machines Armonk, NY $203Bn

Microsoft Redmond, WA $206Bn

DB-hard Queries

31

SELECT Market_CapFrom CompaniesWhere Company_Name = “Apple”

Number of Rows: 0

Problem: Closed World Assumption

Company_Name Address Market Cap

Google Googleplex, Mtn. View CA $170Bn

Intl. Business Machines Armonk, NY $203Bn

Microsoft Redmond, WA $206Bn

DB-hard Queries

32

SELECT Top_1 ImageFrom PicturesWhere Topic = “Business Success”Order By Relevance

Number of Rows: 0

Problem: Subjective Comparison

CrowdDB

33

Use the crowd to answer DB-hard queries

Where to use the crowd:• Find missing data• Make subjective

comparisons• Recognize patterns

But not:• Anything the computer

already does well M. Franklin et al. CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011

CrowdSQL

SELECT * FROM companies WHERE Name ~= “Big Blue”

34

CREATE CROWD TABLE department ( university STRING, department STRING, phone_no STRING) PRIMARY KEY (university, department);

CREATE TABLE company ( name STRING PRIMARY KEY, hq_address CROWD STRING);

DML Extensions:

SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject");

DDL Extensions:

CROWDORDER operators (currently UDFs):CrowdEqual:

Crowdsourced columns Crowdsourced tables

User Interface Generation

• A clear UI is key to response time and answer quality.

• Can leverage the SQL Schema to auto-generate UI (e.g., Oracle Forms, etc.)

35

Subjective Comparisons

37

MTFunction • implements the CROWDEQUAL and CROWDORDER

comparison • Takes some description and a type (equal, order)

parameter• Quality control again based on majority vote• Ordering can be further optimized (e.g., Three-way

comparisions vs. Two-way comparisons)

Does it Work?: Picture ordering

38

Query:SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject");

Data-Size: 30 subject areas, with 8 pictures eachBatching: 4 orderings per HITReplication: 3 Assignments per HITPrice: 1 cent per HIT

(turker-votes, turker-ranking, expert-ranking)

User Interface vs. Quality

(Department first) (Professor first) (De-normalized Probe)

≈20% Error-Rate ≈80% Error-Rate39

≈10% Error-Rate

To get informationabout Professorsand their Departments…

Can we build a “Crowd Optimizer”?

40

Select *From RestaurantWhere city = …

Price vs. Response Time

41

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 600%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

$0.01 $0.02 $0.03 $0.04

Time (mins)

Perc

enta

ge o

f HIT

s th

at h

ave

at le

ast

one

assi

gnm

ent c

ompl

eted

5 Assignments, 100 HITs

Turker Affinity and Errors

42 [Franklin, Kossmann, Kraska, Ramesh, Xin: CrowdDB: Answering Queries with Crowdsourcing. SIGMOD,2011]

Turker Rank

Turker Affinity and Errors

43 [Franklin, Kossmann, Kraska, Ramesh, Xin: CrowdDB: Answering Queries with Crowdsourcing. SIGMOD,2011]

Turker Rank

Can we build a “Crowd Optimizer”?

44

Select *From RestaurantWhere city = …

I would do work for this requester again.

This guy should be shunned.

I advise not clicking on his “information about restaurants” hits.

Hmm... I smell lab rat material.

be very wary of doing any work for this requester…

Processor Relations?

Tim Klas Kraska

HIT Group » I recently did 299 HITs for this requester.… Of the 299 HITs I completed, 11 of them were rejected without any reason being given. Prior to this I only had 14 rejections, a .2% rejection rate. I currently have 8522 submitted HITs, with a .3% rejection rate after the rejections from this requester (25 total rejections). I have attempted to contact the requester and will update if I receive a response. Until then

be very wary of doing any work for this requester, as it appears that they are rejecting about 1 in every 27 HITs being submitted. posted by …

fair:2 / 5 fast:4 / 5 pay:2 / 5 comm:0 / 5

45

Open World = No Semantics?

46

Select * From Crowd_Sourced_Table

• What does the above query return?

• In the old world, it was just a table scan.• In the crowdsourced world:

• Which answers are “Right”?• When to stop?

• BioStatistics to the Rescue????

Open World = No Semantics?

47

Select * From Crowd_Sourced_Table

Species Acquisition Curve for Data

Why the Crowd is Different• Classical Approaches don’t quite work• Incorrect Answers

– Chicago is not a State– How to spell Mississippi?

• Streakers vs. Samplers– Individuals sample without replacement– and Worker/Task affinity

• List Walking– e.g., Google “Ice Cream Flavors”

• The above can be detected and mitigated to some extent.

48

How Can You Trust the Crowd?

• General Techniques– Approval Rate / Demographic Restrictions– Qualification Test– Gold Sets/Honey Pots– Redundancy– Verification/Review– Justification/Automatic Verification

• Query Specific Techniques• Worker Relationship Management49

Making Sense at Scale• Data Size is only part of the challenge• Balance quality, cost and time for a given problem• To address, we must Holistically integrate

Algorithms, Machines, and People

amplab.cs.berkeley.edu

50

Documents

Making Sense at Scale with Algorithms, Machines & People Michael Franklin EECS, Computer Science UC Berkeley Emory U December 7, 2011