Upload
maria-imogen-fisher
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Making Sense at Scale withAlgorithms, Machines & People
Michael FranklinEECS, Computer Science
UC Berkeley
Emory UDecember 7, 2011
Defining the Big Data Problem
2
Size
+
Complexity
=Answers that don’t meet quality, time and cost requirements.
The State of the Art
Algorithms
Machines
People
search
Watson/IBM
3
Needed: A Holistic Approach
search
Watson/IBM
Machines
People
Algorithms
4
AMP Team• 8 (primary) Faculty at Berkeley
• Databases, Machine Learning, Networking, Security, Systems, …
• 4 Partner ApplicationsParticipatory Sensing
Mobile Millenium (Alex Bayen, Civ Eng)
Collective Discovery
Opinion Space (Ken Goldberg, IEOR)
Urban Planning and Simulation
UrbanSim (Paul Waddell, Env. Des.)
Cancer Genomics/Personalized Medicine (Taylor Sittler, UCSF)
5
Big Data Opportunity• The Cancer Genome Atlas (TCGA)
– 20 cancer types x 500 patients each x 1 tumor genome + 1 normal genome = 5 petabytes
– David Haussler (UCSC) Datacenter online 12/11?– Intel donate AMP Lab cluster, put it next to TCGA
6
Slide from David Haussler, UCSC, “Cancer Genomics,” AMP retreat, 5/24/11
Berkeley Systems Lab Model
7
Industrial Collaboration:
“Two Feet In” Model:
Berkeley Data Analytics System (BDAS)
8
Infra. Builder
Algo/Tools
Data Collector
Data Analyst
Higher Query Languages / Processing Frameworks
Resource Management
StorageData
CollectorCrowd
Interface
Analytics Libraries, Data Integration
Data Source Selector
Result Control Center
Visualization
Qua
lity
Con
trol
Mon
itorin
g/D
ebug
ging
A Top-to-Bottom Rethinking of the Big Data Analytics Stack integratingAlgorithms, Machines, and People
Algos: More Data = Better Answers
Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound)
Error bars on every answer!
Est
imat
e
# of data points
true answer
Towards ML/Systems Co-Design
• Some ingredients of a system that can estimate and manage statistical risk:– Distributed bootstrap (bag of little bootstraps; BLB)– Subsampling (stratified)– Active sampling (cf. crowd-sourcing)– Bias estimation (especially with crowd-sourced data)– Distributed optimization– Streaming versions of classical ML algorithms– Streaming distributed bootstrap
• These all must be scalable, and robust
Machines Agenda• New software stack to
• Effectively manage cluster resources• Effectively extract value out of big data
• Projects:• “Datacenter OS”
• Extend Mesos distributed resource manager
• Common Runtime• Structured, Unstructured, Streaming, Sampling, …
• New Processing frameworks & storage systems• E.g., Spark – parallel environment for iterative algorithms
• Example: Quicksilver Query Processor• Allows users to navigate tradeoff space (quality,
time, and cost) for complex queries11
QuickSilver: Where do We Want to Go?
12
“simple” queries on PBs of data take hoursToday:
sub-second arbitrary queries on PBs of dataIdeal:
compute complex queries on PBs of data in < x seconds with < y% error
Goal:
People• Make people an integrated part of
the system!• Leverage human activity• Leverage human intelligence
(crowdsourcing)
Use the crowd to:
Find missing data
Integrate data
Make subjective comparisons
Recognize patterns
Solve problems13
Machines + Algorithms
data
, ac
tivity
Que
stio
ns Answ
ers
Human-Tolerant ComputingPeople throughout the analytics lifecycle?
• Inconsistent answer quality• Incentives• Latency & Variance• Open vs. Closed world• Hybrid Human/Machine Design
Approaches:• Statistical Methods for error and bias• Quality-conscious Interface design• Cost (time, quality)-based optimization
14
CROWDSOURCING EXAMPLES
Citizen ScienceNASA “Clickworkers” circa 2000
Citizen Journalism/Participatory Sensing
17
Expert Advice
Data collection
• Freebase
One View of Crowdsourcing
From Quinn & Bederson, “Human Computation: A Survey and Taxonomy of a Growing Field”, CHI 2011.
Industry View
Participatory Culture - Explicit
22
Participatory Culture – Implicit
23
John Murrell: GM SV 9/17/09…every time we use a Google app or service, we are working on behalf of the search sovereign, creating more content for it to index and monetize or teaching it something potentially useful about our desires, intentions and behavior.
Types of Tasks
Task Granularity Examples
Complex Tasks • Build a website• Develop a software system• Overthrow a government?
Simple Projects • Design a logo and visual identity• Write a term paper
Macro Tasks • Write a restaurant review• Test a new website feature• Identify a galaxy
Micro Tasks • Label an image• Verify an address• Simple entity resolution
Inspired by the report: “Paid Crowdsourcing”, Smartsheet.com, 9/15/2009
Amazon Mechanical Turk (AMT)
A Programmable Interface
26
• Amazon Mechanical Turk API• Requestors place Human Intelligence Tasks
(HITs) via “createHit()” API• Parameters include: #of replicas, expiration, User
Interface,…
• Requestors approve jobs and payment “getAssignments()”, “approveAssignments()”
• Workers (a.k.a. “turkers”) choose jobs, do them, get paid
Worker’s View
Requestor’s Veiew
CrowdDB: A Radical New Idea?
“The hope is that, in not too many years, human brains and computing machines will be coupled together very tightly, and that the resulting partnership will think as no human brain has ever thought and process data in a way not approached by the information-handling machines we know today.”
Problem: DB-hard Queries
30
SELECT Market_CapFrom CompaniesWhere Company_Name = “IBM”
Number of Rows: 0
Problem: Entity Resolution
Company_Name Address Market Cap
Google Googleplex, Mtn. View CA $170Bn
Intl. Business Machines Armonk, NY $203Bn
Microsoft Redmond, WA $206Bn
DB-hard Queries
31
SELECT Market_CapFrom CompaniesWhere Company_Name = “Apple”
Number of Rows: 0
Problem: Closed World Assumption
Company_Name Address Market Cap
Google Googleplex, Mtn. View CA $170Bn
Intl. Business Machines Armonk, NY $203Bn
Microsoft Redmond, WA $206Bn
DB-hard Queries
32
SELECT Top_1 ImageFrom PicturesWhere Topic = “Business Success”Order By Relevance
Number of Rows: 0
Problem: Subjective Comparison
CrowdDB
33
Use the crowd to answer DB-hard queries
Where to use the crowd:• Find missing data• Make subjective
comparisons• Recognize patterns
But not:• Anything the computer
already does well M. Franklin et al. CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011
CrowdSQL
SELECT * FROM companies WHERE Name ~= “Big Blue”
34
CREATE CROWD TABLE department ( university STRING, department STRING, phone_no STRING) PRIMARY KEY (university, department);
CREATE TABLE company ( name STRING PRIMARY KEY, hq_address CROWD STRING);
DML Extensions:
SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject");
DDL Extensions:
CROWDORDER operators (currently UDFs):CrowdEqual:
Crowdsourced columns Crowdsourced tables
User Interface Generation
• A clear UI is key to response time and answer quality.
• Can leverage the SQL Schema to auto-generate UI (e.g., Oracle Forms, etc.)
35
Subjective Comparisons
37
MTFunction • implements the CROWDEQUAL and CROWDORDER
comparison • Takes some description and a type (equal, order)
parameter• Quality control again based on majority vote• Ordering can be further optimized (e.g., Three-way
comparisions vs. Two-way comparisons)
Does it Work?: Picture ordering
38
Query:SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject");
Data-Size: 30 subject areas, with 8 pictures eachBatching: 4 orderings per HITReplication: 3 Assignments per HITPrice: 1 cent per HIT
(turker-votes, turker-ranking, expert-ranking)
User Interface vs. Quality
(Department first) (Professor first) (De-normalized Probe)
≈20% Error-Rate ≈80% Error-Rate39
≈10% Error-Rate
To get informationabout Professorsand their Departments…
Can we build a “Crowd Optimizer”?
40
Select *From RestaurantWhere city = …
Price vs. Response Time
41
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 600%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
$0.01 $0.02 $0.03 $0.04
Time (mins)
Perc
enta
ge o
f HIT
s th
at h
ave
at le
ast
one
assi
gnm
ent c
ompl
eted
5 Assignments, 100 HITs
Turker Affinity and Errors
42 [Franklin, Kossmann, Kraska, Ramesh, Xin: CrowdDB: Answering Queries with Crowdsourcing. SIGMOD,2011]
Turker Rank
Turker Affinity and Errors
43 [Franklin, Kossmann, Kraska, Ramesh, Xin: CrowdDB: Answering Queries with Crowdsourcing. SIGMOD,2011]
Turker Rank
Can we build a “Crowd Optimizer”?
44
Select *From RestaurantWhere city = …
I would do work for this requester again.
This guy should be shunned.
I advise not clicking on his “information about restaurants” hits.
Hmm... I smell lab rat material.
be very wary of doing any work for this requester…
Processor Relations?
Tim Klas Kraska
HIT Group » I recently did 299 HITs for this requester.… Of the 299 HITs I completed, 11 of them were rejected without any reason being given. Prior to this I only had 14 rejections, a .2% rejection rate. I currently have 8522 submitted HITs, with a .3% rejection rate after the rejections from this requester (25 total rejections). I have attempted to contact the requester and will update if I receive a response. Until then
be very wary of doing any work for this requester, as it appears that they are rejecting about 1 in every 27 HITs being submitted. posted by …
fair:2 / 5 fast:4 / 5 pay:2 / 5 comm:0 / 5
45
Open World = No Semantics?
46
Select * From Crowd_Sourced_Table
• What does the above query return?
• In the old world, it was just a table scan.• In the crowdsourced world:
• Which answers are “Right”?• When to stop?
• BioStatistics to the Rescue????
Open World = No Semantics?
47
Select * From Crowd_Sourced_Table
Species Acquisition Curve for Data
Why the Crowd is Different• Classical Approaches don’t quite work• Incorrect Answers
– Chicago is not a State– How to spell Mississippi?
• Streakers vs. Samplers– Individuals sample without replacement– and Worker/Task affinity
• List Walking– e.g., Google “Ice Cream Flavors”
• The above can be detected and mitigated to some extent.
48
How Can You Trust the Crowd?
• General Techniques– Approval Rate / Demographic Restrictions– Qualification Test– Gold Sets/Honey Pots– Redundancy– Verification/Review– Justification/Automatic Verification
• Query Specific Techniques• Worker Relationship Management49
Making Sense at Scale• Data Size is only part of the challenge• Balance quality, cost and time for a given problem• To address, we must Holistically integrate
Algorithms, Machines, and People
amplab.cs.berkeley.edu
50