27
Using Markov Chain Monte Carlo To Play Trivia Daniel Deutch ICDE ‘11 [Deutch, Greenshpan, Kostenko ,Milo] PODS ‘10 [Deutch, Koch, Milo] WebDAM meeting March 2011

Using Markov Chain Monte Carlo To Play Trivia

  • Upload
    oakes

  • View
    27

  • Download
    2

Embed Size (px)

DESCRIPTION

Using Markov Chain Monte Carlo To Play Trivia. Daniel Deutch ICDE ‘11 [Deutch, Greenshpan, Kostenko ,Milo] PODS ‘10 [Deutch, Koch, Milo] WebDAM meeting March 2011. Answering trivia questions automatically Based on knowledge collected from a crowd of users - PowerPoint PPT Presentation

Citation preview

Page 1: Using Markov Chain Monte Carlo To Play Trivia

Using Markov Chain Monte Carlo To Play Trivia

Daniel DeutchICDE ‘11 [Deutch, Greenshpan, Kostenko ,Milo]

PODS ‘10 [Deutch, Koch, Milo]

WebDAM meeting

March 2011

Page 2: Using Markov Chain Monte Carlo To Play Trivia

• Answering trivia questions automatically

• Based on knowledge collected from a crowd of users (human Trivia Players)

• Some of this knowledge is erroneous

– But we don’t know which

• Some of the users are more credible than others

– But we don’t know in advance the credibility of all users

Our goal

2

Page 3: Using Markov Chain Monte Carlo To Play Trivia

System Architecture

3

Page 4: Using Markov Chain Monte Carlo To Play Trivia

Trivia Game

4

Page 5: Using Markov Chain Monte Carlo To Play Trivia

Question Answering

5

Page 6: Using Markov Chain Monte Carlo To Play Trivia

Say that we collect information on Basketball players and their teams

[Support is the number of people who claimed the fact to be true]

Now we want to answer questions

• Simple ones: “Where does Gasol play?”• Or more complex, involving additional information:

“Which team has the greatest number of players taller than 7ft?”

Motivating Example

PlayerTeamSupport

GasolMemphis003

GasolLakers100

DuncanSan Antonio50

DuncanNew Jersey50

6

Page 7: Using Markov Chain Monte Carlo To Play Trivia

• By majority vote?

• But some people know NBA basketball better than others..

• So maybe a “biased vote”?

• But how to bias?

• A “chicken or the egg” problem: To

know what is true we need to know who to believe. But to know who to believe we need to know who is usually right (and in particular, what is true..)

How would we answer?

7

Page 8: Using Markov Chain Monte Carlo To Play Trivia

• Start with some estimation on the trust in users

• Gain confidence in facts based on the opinion of users that supported them– Give more weight to users that we trust

• Then update the trust level in users, based on how many of the facts which they submitted, we believe

• Iterate until convergence

Trusted players give us confidence in facts, and players that supported these facts gain our trust…

So what can we do?

8

Page 9: Using Markov Chain Monte Carlo To Play Trivia

Start with some prior belief in every userdo { Probabilistically choose a correct team for every player based on popular support (biased by the belief in users)

Evaluate the query, add a point to every obtained query result

Increase your belief in every user that supported the chosen assertion

}

until (convergence of the ratio of results)

(One possible) Probabilistic Algorithm

9

Page 10: Using Markov Chain Monte Carlo To Play Trivia

Example Use Case

PlayerTeamUser

GasolLakersAlice

GasolLakersBob

GasolMemphisCarol

DuncanNew JerseyCarol

DuncanSan AntonioBob

UserConfidence

Alice6

Bob2

Carol2

10

Page 11: Using Markov Chain Monte Carlo To Play Trivia

Partial Execution

UC

A6

B2

C2

PT

GLA

DSA

PT

GLA

DNJ

PT

GM

DNJ

PT

GM

DSA

8/10*2/4

8/10*2/4

2/10*2/4

2/10*2/4

UC

A7

B4

C2

UC

A7

B3

C3

………

11

Page 12: Using Markov Chain Monte Carlo To Play Trivia

• We believe Alice,

But we don’t know whether to believe Bob or Carol Hence don’t know if Duncan plays for NJ or SA [Note that (weighted) majority vote won’t help us in deciding this]

• We have some reason to believe Bob more: We think he was correct saying that Gasol plays for LA [We think this because we believe Alice]

[Therefore Bob may have better Basketball knowledge]

• In general things are more complex(more users,facts..)

What’s going on?

12

Page 13: Using Markov Chain Monte Carlo To Play Trivia

With high probability, Bob will eventually gain a significant advantage (over Carol) in his confidence level

– At the first step there’s• 40% chance of him getting a two points gain over Carol,• 40% chance of them having the same confidence level, • 20% chance for Carol to win a one point gain over Bob

– If Bob (eventually) wins an advantage, his chances of winning an extra advantage become even higher in the following steps

– With good chance, we will eventually get “almost certain” that Alice and Bob are both more credible than Carol

• And will conclude that Duncan Plays for San Antonio

What’s going on? (cont.)

13

Page 14: Using Markov Chain Monte Carlo To Play Trivia

• We can repeat the algorithm N times, each time restarting the confidence levels to their original ones

• In our example, almost 80% of the executions will result in believing that Duncan plays for SA

• Our automated Trivia query engine can answer “SA” to “Where does Duncan play?”, with good confidence

• This kind of algorithm is called Markov Chain Monte Carlo

Sampling

14

Page 15: Using Markov Chain Monte Carlo To Play Trivia

• Markov Chain:– Finite State Machine with probabilities on its transitions– Markovian Property: the next state is independent of the

past states (given the current state)

• In our case the states correspond to database instances

• The probabilities on the transitions are determined by the algorithm

Markov Chain

15

Page 16: Using Markov Chain Monte Carlo To Play Trivia

• A Monte Carlo algorithm

– A general term for an algorithm that relies on sampling

[We sample here over multiple traversals on the Markov Chain]

• Monte Carlo algorithms are by nature approximation algorithms

• But they approximate pretty well in practice

Monte Carlo

16

Page 17: Using Markov Chain Monte Carlo To Play Trivia

• The algorithm we have presented is useful by itself

• We could implement it in Java and use it – But this may be complex for large datasets…

• Also, there are many other algorithms in the literature

• Conclusion: – We want to have easy control on the employed policy

– We really don’t want to rewrite Java code for each tiny change!

– We want (seamless) optimization, update propagation.,,,

General Policies

17

Page 18: Using Markov Chain Monte Carlo To Play Trivia

• Database approach: define a declarative language for specifying policies

• We defined a new language [PODS ‘10] which combines datalog (for recursion) with a way of making probabilistic choices

• Using this language we can define MCMC algorithms!

• Declarative specifications are robust, generic, and allow optimizations.

General Policies (cont.)

18

Page 19: Using Markov Chain Monte Carlo To Play Trivia

• Add to datalog a REPAIR-KEY (RK) [Koch ‘09] construct

• REPAIR-KEY “repairs” key violations (=contradictions) in the database, choosing one possible option, probabilistically, according to the sum of support in this option, relative to the other options

General Policies (cont.)

19

Page 20: Using Markov Chain Monte Carlo To Play Trivia

Example Use Case

PlayerTeamUser

GasolLakersAlice

GasolLakersBob

GasolMemphisCarol

DuncanNew Jersey

Carol

DuncanSan Antonio

Bob

UserConf.

Alice3

Bob2

Bob2

Carol2

Carol2

20

UserConf

UserFacts

Page 21: Using Markov Chain Monte Carlo To Play Trivia

DROP BelievedFacts;

INSERT INTO BelievedFacts

REPAIR-KEY[Player @ Confidence] ON

UserFacts INNER JOIN UsersConf;

Implementation in probabilistic datalog

21

Page 22: Using Markov Chain Monte Carlo To Play Trivia

UPDATE UsersConf, Q1

SET UsersConf.confidence = Q1.CorrectFacts + UsersConf.confidence

WHERE UsersConf.user = Q1.user;

Q1 = SELECT user, COUNT(DISTINCT player) As CorrectFacts

FROM Q2

GROUP BY user;

Q2 = SELECT user,team,player

FROM UserFacts UF

WHERE EXIST

(SELECT *

FROM BelievedFacts BF

WHERE BF.player = UF.player AND

BC.team = UF.team)

Implementation in probabilistic datalog

22

Page 23: Using Markov Chain Monte Carlo To Play Trivia

• The prob. of a query result is the probability that it appears in the evaluation on the database, at an arbitrary point of the walk

• More formally:– The probability of observing a sequence seq = [s1,…,sn] is the

multiplication of transition probabilities– Let |seq| denote the length of seq– Given a database state s, Pr(s)=

– The prob. of a tuple t is the sum of Pr(s) over all s for which t appears in Q(s)

Semantics

| |

Page 24: Using Markov Chain Monte Carlo To Play Trivia

Problem: Given a “probabilistic datalog” program and an SQL query on the

database (“where does Duncan play”), compute the probabilities of the different answers to appear at a random time of the walk.

• Theorem: Exact computation is NP-hard

• Theorem: If the MC is ergodic, then computable in EXPTIME in number of states

• Compute the stochastic matrix of transitions• Compute its fixpoint• For ergodic Markov Chain corresponds to correct probabilities• Sum up probabilities of states where the query event holds

• Theorem: In general, 2-EXPTIME • Apply the above to each connected component of the Markov Chain• Factor by probability of being in each component

Some Complexity Results

24

Page 25: Using Markov Chain Monte Carlo To Play Trivia

Approximations:– Absolute approximation: approximates correct probability ±ε– Relative approximation: approximates correct probability up to a factor

in-between (1- ε), (1+ ε).[Relative is harder to achieve]

• Theorem: No PTIME (randomized) approximation exists unless P=NP

• Reduction from 3-SAT• Construct a MC that probabilistically chooses assignments• Query event asks whether the given assignment is satisfying• Such assignment will be found with prob. > 0 iff exists• Consequently deciding if the probability is 0 or not is hard

Approximation Algorithms

25

Page 26: Using Markov Chain Monte Carlo To Play Trivia

• Theorem: For ergodic underlying MC, randomized absolute approximation

possible in PTIME w.r.t. DB size and MC mixing time

• The PTIME algorithm is a sampling algorithm [In the spirit of the algorithm demonstrated earlier.]

Approximation Algorithms (cont.)

26

Page 27: Using Markov Chain Monte Carlo To Play Trivia

• How (and when) can we evaluate things fast enough?

• How to store the vast amount of data?• Distributed Databases? Map-reduce?

• How to motivate users to contribute useful facts?• First need to find what data is needed/missing…• Give user points, based on answers novelty & correctness ?• But we do not know which answers are correct !• Well, we can use our estimation algorithm…• Other possible techniques?

• The data keeps changing. How to handle updates?

Other important issues

27