When is A=B? Donald Kossmann Systems Group, ETH Zurich

When is A=B?

Donald KossmannSystems Group, ETH Zurich

http://systems.ethz.ch

Acknowledgments

Insanity: doing the same thing over and over again and expecting different results. (A. Einstein)

Insanity: doing the same thing over and over again and expecting different results. (A. Einstein)

Reality: We all are insane! • When do you start believing that your paper is

not worth publishing?

Speculations on IT Trends• Big Data: Automating Experience– Logic -> Statistics– Open World Semantics

• Hybrid Systems: Get best of humans & machines– to err is human

• Systems– DNA, Quantum: trade energy for precision– Distributed systems: design for failure– Intel’s SCC: non-cache-coherent processors

Speculations on IT Trends

• Big Data: Automating Experience– Logic -> Statistics– Open World Semantics

• Hybrid Human & Machine Systems– to err is human

• Systems– DNA HW: trade energy consumption for precision– Distributed systems: design for failure

Computers are becoming insane!

Implications

• We need to model insanity– (too crazy for this talk)– (will use Mechanical Turk to simulate craziness)

• We need to revisit algos & complexity theory– focus of this talk

Traditional Complexity Theory

• Cost is a function of input

• Example: sorting in O(N * log N)

Algo/Problem

cost

input

“Modern” Complexity Theory

• Cost is a function of input, quality, error rate

• Example: sorting is O(???)

Algo/Problem

cost

input quality error

Alternative Complexity Theory

• Quality is a function of input, budget, error rate

• Example: sorting is O(???)

Algo/Problem

quality

input budget error

Agenda

• Case Study: Entity Resolution, Joins– when is A=B?

• Case Study: Sorting– when is A<B?

Problem Statement

• You are the director of the Louvre– you have gazillions of unknown paintings– you have a bunch of students that guess: p(A) = p(B)?

• You would like to group the paintings by painter– minimize cost (work of students)– minimize errors (#paintings in wrong room)

• Assumption: There is a ground truth!– (Many problems have no ground truth;

e.g., grouping the best paintings.)

Naïve Algorithm

• Step 1: select two random paintings

• Step 2: ask students to compare them

• Step 3: goto Step 1 until done

• How can we do better???

Votes Graph

A B

C D

• Is A = B?

Votes Graph

A B

C D

• Is A = B? YES!

Votes Graph

A B

C D

Votes Graph

A B

C D

• Is B = C?• Is A = D?

Votes Graph

A B

C D

• Is B = C? YES!• Is A = D? NO!

Votes Graph

A B

C D

• Is B = C? ???

Votes Graph

A B

C D

• Is B = C? YES!

50

30

-100

-1

Decision Functions

• Input: Votes graph (with weights)two nodes

• Output: Yes, No, Do-not-know

• Desired Properties:– Consistency: do not invent anything– Convergence: do not always punt– Reflexivity, Symmetry, Transitivity, Anti-transitivity

Min-Max Function• Compute pScore, nScore– take all positive, negative paths– score of path: minimum of weights of edges (AND)– pScore = maximum of score of all positive paths (OR)– nScore = maximum of score of all negative paths (OR)

• Make decision based on quorum (e.g., q=3)– Yes: pScore – nScore > q– No: nScore – pScore > q– Do-not-know: otherwise

Min/Max with Conflicts

A B

C D

• Is B = C? YES• pScore = 30• nScore = 1

• Is A = D? NO• pScore = 0• nScore = 30

50

30

-100

-1

Naïve Algorithm V2.0

• Step 1: select two random paintings, p1, p2

• Step 2: if (MinMax(p1,p2) == Do-not-know)

ask students to compare themelse return MinMax(p1, p2)

• Step 3: goto Step 1 until done

Min/Max and Transitivity?

B C

A

D5

5 -2

E

5

3

A = D? YES• pScore = 5• nScore = 2

D = E? YES• pScore = 3• nScore = 0

A = E? Do-not-know• pScore = 3• nScore = 2

When is A=E?

B C

A

D5

5 -2

E

5

3

Compute “A=E”: Need at least 5 votes for success.Compute “D=E”: In best case, only 2 more votes needed.

When is A=E?

B C

A

D5

5 -2

E

5

3

Crowdsource A=E: Need at least 5 votes for success.Crowdsource D=E: In best case, only 2 votes needed.

Many more surprises like that!!!

Related Work & Alternatives

• R. Fagin, E. Wimmer: A formula for incorporating weights into scoring rules. 2000.

• M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011.

• Huge body of work on ER in DB, II communities.

• Other decision function: MinCuts!

Summary

• Getting A=B right more important than algorithm– Naïve algo with Min/Max >> Correlation Clustering

• Result of A=B depends on C, D, …– sounds trivial, but has nasty implications– need a decision function: new cost/precision tradeoffs – Some trad. algos (e.g., CC) do not work

• Complexity: Still unknown!– interesting future work

Agenda

• Case Study: Entity Resolution, Joins– when is A=B?

• Case Study: Sorting– when is A<B?

Revisit Sorting Algos

• How do traditional sorting algorithms behave– Quicksort – Bubblesort

• Look at new sorting algorithms based on graph– PageRank– Min/Max– Schulze method

• Focus on Quicksort vs. Bubblesort here– Just give a glimpse of what can happen

Quicksort: Effect of built-in transitivity

• Sort the following sequenceNeutral, Painful, Good, Excellent, Bad

• Use “Good” as pivot element for partitioningFumble “Painful < Good” comparisonExcellent, Painful, Good, Neutral, Bad

• One bad comparison propagates to three misclassifications– quality of result can become arbitrarily bad– difficult to extend QSort algo with safety net.

Results (20% error, uniform)

10 20 30 400

20

40

60

80

100

120

QuickSortBubbleSort

Cost (number of iterations of algorithm)

Quality (%)

Summary

• Some algos implicitly exploit transitivity– difficult to control cost/quality tradeoff– might result in a poor result for specific application

• QuickSort >> Bubblesort no longer true– depends on error and quality expectation– there are better and worse ways to exploit transitivity

depending on budget and error behavior– confirms observations of “A=B” study

Related Work on Sorting

• Ludwig Busse et al.: The information content in sorting algorithms. 2012.

• M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011.

• Qurk (MIT) & Deco (Stanford) projects. 2011-2013.

• …

Conclusion & Future Work

• Computers are becoming insane– because they automate more of the insane world– because we are hitting the limits of trad. computing– consequence: quality becomes a major metric

• Adding “quality” has dramatic implications– need to revisit algorithms to become fault-tolerant– need to revisit complexity: totally open– need to revisit debugging and testing: totally open

Documents

When is A=B? Donald Kossmann Systems Group, ETH Zurich