POKER AGENTS LD Miller & Adam EckMay 3, 2011. Motivation 2 Classic environment properties of MAS Stochastic behavior (agents and environment) Incomplete

Embed Size (px)

Citation preview

  • Slide 1
  • POKER AGENTS LD Miller & Adam EckMay 3, 2011
  • Slide 2
  • Motivation 2 Classic environment properties of MAS Stochastic behavior (agents and environment) Incomplete information Uncertainty Application Examples Robotics Intelligent user interfaces Decision support systems
  • Slide 3
  • Overview 3 Background Methodology (Updated) Results (Updated) Conclusions (Updated)
  • Slide 4
  • Background| Texas Holdem Poker 4 Games consist of 4 different steps Actions: bet (check, raise, call) and fold Bets can be limited or unlimited BackgroundMethodologyResultsConclusions community cardsprivate cards (1) pre-flop(2) flop(3) turn(4) river
  • Slide 5
  • Background| Texas Holdem Poker 5 Significant worldwide popularity and revenue World Series of Poker (WSOP) attracted 63,706 players in 2010 (WSOP, 2010) Online sites generated estimated $20 billion in 2007 (Economist, 2007) Has fortuitous mix of strategy and luck Community cards allow for more accurate modeling Still many outs or remaining community cards which defeat strong hands BackgroundMethodologyResultsConclusions
  • Slide 6
  • Background| Texas Holdem Poker 6 Strategy depends on hand strength which changes from step to step! Hands which were strong early in the game may get weaker (and vice-versa) as cards are dealt BackgroundMethodologyResultsConclusions community cardsprivate cards raise! check?fold?
  • Slide 7
  • Background| Texas Holdem Poker 7 Strategy also depends on betting behavior Three different types (Smith, 2009): Aggressive players who often bet/raise to force folds Optimistic players who often call to stay in hands Conservative or tight players who often fold unless they have really strong hands BackgroundMethodologyResultsConclusions
  • Slide 8
  • Methodology| Strategies 8 Solution 2: Probability distributions Hand strength measured using Poker Prophesier (http://www.javaflair.com/pp/) BackgroundMethodologyResultsConclusions TacticFoldCallRaise Weak[00.7)[0.70.95)[0.951) Medium[00.3)[0.30.7)[0.71) Strong[00.05)[0.050.3)[0.31) BehaviorWeakMediumStrong Aggressive[00.2)[0.20.6)[0.61) Optimistic[00.5)[0.50.9)[0.91) Conservative[00.3)[0.30.8)[0.81) (1) Check hand strength for tactic (2) Roll on tactic for action
  • Slide 9
  • Methodology| Deceptive Agent 9 Problem 1: Agents dont explicitly deceive Reveal strategy every action Easy to model Solution: alternate strategies periodically Conservative to aggressive and vice-versa Break opponent modeling (concept shift) BackgroundMethodologyResultsConclusions
  • Slide 10
  • Methodology| Explore/Exploit 10 Problem 2: Basic agents dont adapt Ignore opponent behavior Static strategies Solution: use reinforcement learning (RL) Implicitly model opponents Revise action probabilities Explore space of strategies, then exploit success BackgroundMethodologyResultsConclusions
  • Slide 11
  • Methodology| Active Sensing 11 Opponent model = knowledge Refined through observations Betting history, opponents cards Actions produce observations Information is not free Tradeoff in action selection Current vs. future hand winnings/losses Sacrifice vs. gain BackgroundMethodologyResultsConclusions
  • Slide 12
  • Methodology| Active Sensing 12 Knowledge representation Set of Dirichlet probability distributions Frequency counting approach Opponent state s o = their estimated hand strength Observed opponent action a o Opponent state Calculated at end of hand (if cards revealed) Otherwise 1 s Considers all possible opponent hands BackgroundMethodologyResultsConclusions
  • Slide 13
  • Methodology| BoU 13 Problem: Different strategies may only be effective against certain opponents Example: Doyle Brunson has won 2 WSOP with 7-2 off suit worst possible starting hand Example: An aggressive strategy is detrimental when opponent knows you are aggressive Solution: Choose the correct strategy based on the previous sessions BackgroundMethodologyResultsConclusions
  • Slide 14
  • Methodology| BoU 14 Approach: Find the Boundary of Use (BoU) for the strategies based on previously collected sessions BoU partitions sessions into three types of regions (successful, unsuccessful, mixed) based on the session outcome Session outcome complex and independent of strategy Choose the correct strategy for new hands based on region membership BackgroundMethodologyResultsConclusions
  • Slide 15
  • Methodology| BoU 15 BoU Example Ideal: All sessions inside the BoU BackgroundMethodologyResultsConclusions Strategy Incorrect Strategy Correct Strategy ?????
  • Slide 16
  • Methodology| BoU 16 BoU Implementation k-Mediods clustering semi-supervised clustering Similarity metric needs to be modified to incorporate action sequences AND missing values Number of clusters found automatically balancing cluster purity and coverage Session outcome Uses hand strength to compute the correct decision Model updates Adjust intervals for tactics based on sessions found in mixed regions BackgroundMethodologyResultsConclusions
  • Slide 17
  • Results| Overview 17 Validation (presented previously) Basic agent vs. other basic RL agent vs. basic agents Deceptive agent vs. RL agent Investigation AS agent vs. RL /Deceptive agents BoU agent vs. RL/Deceptive agents AS agent vs. BoU agent Ultimate showdown BackgroundMethodologyResultsConclusions
  • Slide 18
  • Results| Overview 18 Hypotheses (research and operational) BackgroundMethodologyResultsConclusions Hypo.SummaryValidated?Section R1AS agents will outperform non-AS...???5.2.1 R2Changing the rate of exploration in AS will...???5.2.1 R3Using the BoU to choose the correct strategy...???5.2.3 O1None of the basic strategies dominates???5.1.1 O2RL approach will outperform basic...and Deceptive will be somewhere in the middle... ???5.1.2-3 O3AS and BoU will outperform RL???5.2.1-2 O4Deceptive will lead for the first part of games...???5.2.1-2 O5AS will outperform BoU when BoU does not have any data on AS ???5.2.3
  • Slide 19
  • Results| RL Validation 19 BackgroundMethodologyResultsConclusions Matchup 1: RL vs. Aggressive HSFoldCallRaise 10.10130.86070.0380 20.30050.65680.0427 30.28410.68150.0344 40.35420.50640.1393 50.18270.68280.1345 60.17270.68570.1417 70.05300.88480.0622 80.00840.97840.0133 90.00120.11300.8858 100.00030.07150.9281
  • Slide 20
  • Results| RL Validation 20 BackgroundMethodologyResultsConclusions Matchup 2: RL vs. Optimistic HSFoldCallRaise 10.17490.79130.0338 20.15650.80510.0384 30.35650.57290.0706 40.32700.42980.2432 50.22520.52880.2460 60.14600.46980.3841 70.05020.61980.3300 80.01850.96320.0183 90.01870.88620.0951 100.00250.26160.7359
  • Slide 21
  • Results| RL Validation 21 BackgroundMethodologyResultsConclusions Matchup 3: RL vs. Conservative HSFoldCallRaise 10.24600.61150.1425 20.19440.68240.1231 30.17970.64260.1778 40.13550.34790.5166 50.16160.42450.4139 60.12360.25710.6193 70.12900.62790.2431 80.06520.78930.1455 90.04290.58420.3729 100.00900.49730.4937
  • Slide 22
  • Results| RL Validation 22 BackgroundMethodologyResultsConclusions Matchup 4: RL vs. Deceptive HSFoldCallRaise 10.41080.57340.0158 20.18350.71040.1062 30.08490.83850.0766 40.26410.54500.1909 50.12070.59890.2804 60.07990.52970.3903 70.08460.84010.0752 80.02660.94190.0315 90.04130.87820.0805 100.01670.46840.5149
  • Slide 23
  • Results| AS Results 23 BackgroundMethodologyResultsConclusions All opponent modeling approaches defeat Explicit modeling better than implicit AS with = 0.2 improves over non-AS due to additional sensing AS with = 0.4 senses too much, resulting in too many lost hands
  • Slide 24
  • Results| AS Results 24 BackgroundMethodologyResultsConclusions All opponent modeling approaches defeat Deceptive Can handle concept shift AS AS with = 0.2 similar to non-AS Little benefit from extra sensing Again AS with = 0.4 senses too much
  • Slide 25
  • Results| AS Results 25 BackgroundMethodologyResultsConclusions AS with = 0.2 defeats non-AS Active sensing provides better opponent model Overcomes additional costs Again AS with = 0.4 senses too much
  • Slide 26
  • Results| AS Results 26 BackgroundMethodologyResultsConclusions Conclusions Mixed results for Hypothesis R1 AS with = 0.2 better than non-AS against RL and heads-up AS with = 0.4 always worse than non-AS Confirm Hypothesis R2 = 0.4 results in too much sensing which results in more losses when the agent should have folded Not enough extra sensing benefit to offset costs
  • Slide 27
  • Results| BoU Results 27 BackgroundMethodologyResultsConclusions BoU is crushed by RL BoU constantly lowers interval for Aggressive RL learns to be super- tight
  • Slide 28
  • Results| BoU Results 28 BackgroundMethodologyResultsConclusions BoU very close to deceptive Both use aggressive strategies BoUs aggressive is much more reckless after model updates
  • Slide 29
  • Results| BoU Results 29 BackgroundMethodologyResultsConclusions Conclusions Hypothesis R3 and O3 do not hold BoU does not outperform deceptive/RL Model update method Updates Aggressive strategy to fix mixed regions Results in emergent behavior reckless bluffing Bluffing is very bad against a super-tight player HSFoldCallRaise 10.2020330.4648720.333095 20.035130.9297410.03513 30.0828220.8578340.059344 40.2901780.5478920.16193 50.0322360.149590.818175 60.0254620.4631110.511426 70.0261120.3004440.673444 80.0096660.9132040.07713 90.0035930.9242410.072166 100.1480270.8518380.000135
  • Slide 30
  • Results| Ultimate Showdown 30 BackgroundMethodologyResultsConclusions And the winner isactive sensing (booo) HSFoldCallRaise 10.02780.86110.1111 20.22610.53040.2435 30.01450.82610.1594 40.01060.76600.2234 50.00860.65520.3362 60.01030.68040.3093 70.19300.48910.3179 80.02860.65710.3143 90.02330.51160.4651 100.02130.51060.4681
  • Slide 31
  • Conclusion| Summary 31 AS > RL > Aggressive > Deceptive >= BoU > Optimistic > Conservative BackgroundMethodologyResultsConclusions Hypo.SummaryValidated?Section R1AS agents will outperform non-AS...Yes5.2.1 R2Changing the rate of exploration in AS will...Yes5.2.1 R3Using the BoU to choose the correct strategy...No5.2.3 O1None of the basic strategies dominatesNo5.1.1 O2RL approach will outperform basic...and Deceptive will be somewhere in the middle... Yes5.1.2-3 O3AS and BoU will outperform RLYes5.2.1-2 O4Deceptive will lead for the first part of games...No5.2.1-2 O5AS will outperform BoU when BoU does not have any data on AS Yes5.2.3
  • Slide 32
  • Questions? 32
  • Slide 33
  • References 33 (Daw et al., 2006) N.D. Daw et. al, 2006. Cortical substrates for exploratory decisions in humans, Nature, 441:876-879. (Economist, 2007) Poker: A big deal, Economist, Retrieved January 11, 2011, from http://www.economist.com/node/10281315?story_id=10281315, 2007. (Smith, 2009) Smith, G., Levere, M., and Kurtzman, R. Poker player behavior after big wins and big losses, Management Science, pp. 1547-1555, 2009. (WSOP, 2010) 2010 World series of poker shatters attendance records, Retrieved January 11, 2011, from http://www.wsop.com/news/2010/Jul/2962/2010-WORLD- SERIES-OF-POKER-SHATTERS-ATTENDANCE-RECORD.html