Upload
maliyah-offield
View
220
Download
2
Embed Size (px)
Citation preview
When is A=B?
Donald KossmannSystems Group, ETH Zurich
http://systems.ethz.ch
Acknowledgments
Insanity: doing the same thing over and over again and expecting different results. (A. Einstein)
Insanity: doing the same thing over and over again and expecting different results. (A. Einstein)
Reality: We all are insane! • When do you start believing that your paper is
not worth publishing?
Speculations on IT Trends• Big Data: Automating Experience– Logic -> Statistics– Open World Semantics
• Hybrid Systems: Get best of humans & machines– to err is human
• Systems– DNA, Quantum: trade energy for precision– Distributed systems: design for failure– Intel’s SCC: non-cache-coherent processors
Speculations on IT Trends
• Big Data: Automating Experience– Logic -> Statistics– Open World Semantics
• Hybrid Human & Machine Systems– to err is human
• Systems– DNA HW: trade energy consumption for precision– Distributed systems: design for failure
Computers are becoming insane!
Implications
• We need to model insanity– (too crazy for this talk)– (will use Mechanical Turk to simulate craziness)
• We need to revisit algos & complexity theory– focus of this talk
Traditional Complexity Theory
• Cost is a function of input
• Example: sorting in O(N * log N)
Algo/Problem
cost
input
“Modern” Complexity Theory
• Cost is a function of input, quality, error rate
• Example: sorting is O(???)
Algo/Problem
cost
input quality error
Alternative Complexity Theory
• Quality is a function of input, budget, error rate
• Example: sorting is O(???)
Algo/Problem
quality
input budget error
Agenda
• Case Study: Entity Resolution, Joins– when is A=B?
• Case Study: Sorting– when is A<B?
Problem Statement
• You are the director of the Louvre– you have gazillions of unknown paintings– you have a bunch of students that guess: p(A) = p(B)?
• You would like to group the paintings by painter– minimize cost (work of students)– minimize errors (#paintings in wrong room)
• Assumption: There is a ground truth!– (Many problems have no ground truth;
e.g., grouping the best paintings.)
Naïve Algorithm
• Step 1: select two random paintings
• Step 2: ask students to compare them
• Step 3: goto Step 1 until done
• How can we do better???
Votes Graph
A B
C D
• Is A = B?
Votes Graph
A B
C D
• Is A = B? YES!
Votes Graph
A B
C D
Votes Graph
A B
C D
• Is B = C?• Is A = D?
Votes Graph
A B
C D
• Is B = C? YES!• Is A = D? NO!
Votes Graph
A B
C D
• Is B = C? ???
Votes Graph
A B
C D
• Is B = C? YES!
50
30
-100
-1
Decision Functions
• Input: Votes graph (with weights)two nodes
• Output: Yes, No, Do-not-know
• Desired Properties:– Consistency: do not invent anything– Convergence: do not always punt– Reflexivity, Symmetry, Transitivity, Anti-transitivity
Min-Max Function• Compute pScore, nScore– take all positive, negative paths– score of path: minimum of weights of edges (AND)– pScore = maximum of score of all positive paths (OR)– nScore = maximum of score of all negative paths (OR)
• Make decision based on quorum (e.g., q=3)– Yes: pScore – nScore > q– No: nScore – pScore > q– Do-not-know: otherwise
Min/Max with Conflicts
A B
C D
• Is B = C? YES• pScore = 30• nScore = 1
• Is A = D? NO• pScore = 0• nScore = 30
50
30
-100
-1
Naïve Algorithm V2.0
• Step 1: select two random paintings, p1, p2
• Step 2: if (MinMax(p1,p2) == Do-not-know)
ask students to compare themelse return MinMax(p1, p2)
• Step 3: goto Step 1 until done
Min/Max and Transitivity?
B C
A
D5
5 -2
E
5
3
A = D? YES• pScore = 5• nScore = 2
D = E? YES• pScore = 3• nScore = 0
A = E? Do-not-know• pScore = 3• nScore = 2
When is A=E?
B C
A
D5
5 -2
E
5
3
Compute “A=E”: Need at least 5 votes for success.Compute “D=E”: In best case, only 2 more votes needed.
When is A=E?
B C
A
D5
5 -2
E
5
3
Crowdsource A=E: Need at least 5 votes for success.Crowdsource D=E: In best case, only 2 votes needed.
Many more surprises like that!!!
Related Work & Alternatives
• R. Fagin, E. Wimmer: A formula for incorporating weights into scoring rules. 2000.
• M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011.
• Huge body of work on ER in DB, II communities.
• Other decision function: MinCuts!
Summary
• Getting A=B right more important than algorithm– Naïve algo with Min/Max >> Correlation Clustering
• Result of A=B depends on C, D, …– sounds trivial, but has nasty implications– need a decision function: new cost/precision tradeoffs – Some trad. algos (e.g., CC) do not work
• Complexity: Still unknown!– interesting future work
Agenda
• Case Study: Entity Resolution, Joins– when is A=B?
• Case Study: Sorting– when is A<B?
Revisit Sorting Algos
• How do traditional sorting algorithms behave– Quicksort – Bubblesort
• Look at new sorting algorithms based on graph– PageRank– Min/Max– Schulze method
• Focus on Quicksort vs. Bubblesort here– Just give a glimpse of what can happen
Quicksort: Effect of built-in transitivity
• Sort the following sequenceNeutral, Painful, Good, Excellent, Bad
• Use “Good” as pivot element for partitioningFumble “Painful < Good” comparisonExcellent, Painful, Good, Neutral, Bad
• One bad comparison propagates to three misclassifications– quality of result can become arbitrarily bad– difficult to extend QSort algo with safety net.
Results (20% error, uniform)
10 20 30 400
20
40
60
80
100
120
QuickSortBubbleSort
Cost (number of iterations of algorithm)
Quality (%)
Summary
• Some algos implicitly exploit transitivity– difficult to control cost/quality tradeoff– might result in a poor result for specific application
• QuickSort >> Bubblesort no longer true– depends on error and quality expectation– there are better and worse ways to exploit transitivity
depending on budget and error behavior– confirms observations of “A=B” study
Related Work on Sorting
• Ludwig Busse et al.: The information content in sorting algorithms. 2012.
• M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011.
• Qurk (MIT) & Deco (Stanford) projects. 2011-2013.
• …
Conclusion & Future Work
• Computers are becoming insane– because they automate more of the insane world– because we are hitting the limits of trad. computing– consequence: quality becomes a major metric
• Adding “quality” has dramatic implications– need to revisit algorithms to become fault-tolerant– need to revisit complexity: totally open– need to revisit debugging and testing: totally open