Fermat's Library | Learning strategies in large imperfect-information games annotated/explained version.

University of Manchester

School of Computer Science

Project Report 2019

Learning strategies in large imperfect-information games

Author: Paul Chelarescu

Supervisor: Dr. Jon Shapiro

Abstract

Learning strategies in large imperfect-information games

Author: Paul Chelarescu

The main purpose of this project is to replicate the abilities of the ﬁrst poker agents which were

recently able to beat top human players at the game of No-Limit Texas Hold’em. The results

are signiﬁcant for the advancement of the artiﬁcial intelligence community and it represents

the ﬁrst game where blufﬁng was a necessary technique to win and which was remarkably well

learned by the agents. Several techniques have been used, and this project’s purpose was to

incrementally build upon core concepts and algorithms employed by the agents and replicate as

much as possible from the most important components given the limited time. The evaluation

was achieved with a measurement used within speciﬁc games, called exploitability. Various

visualizations of the performance of the algorithms used to learn strategies are presented.

Supervisor: Dr. Jon Shapiro

Chapter 1

Introduction

The ﬁeld of Artiﬁcial Intelligence has experienced considerable advancements over the recent

decades, and in many cases the progress has been measured by performance that exceeds that

of humans in games such as checkers, chess and Go. The success in these games is based on

the assumption of perfect information, or information symmetry, where all players have the

same information about the current state of the game. Since determining the optimal strategy

only requires knowledge of the current state, this fact has been leveraged by nearly every AI

developed for perfect-information games, including AIs that defeated top humans in chess and

Go [28]. This property had a direct implication for the methods being used to ﬁnd solutions

in these games, such as minimax search with alpha-beta pruning [20] and Monte Carlo Tree

Search with policy evaluation [24]. Up to 2015, every non-trivial game that had been solved is

a perfect-information game [28].

Despite these successes, and unlike parlor games, decision making in the real world can

rarely assume settings with perfect information. Much more realistic interactions between

multiple agents include the lack of complete knowledge of past events and blufﬁng with in-

formation held only privately. Dealing with hidden information requires drastically different

AI techniques. In 2015, it was announced that Heads-Up Limit Texas Hold’em, the smallest

variant of poker that is routinely played by humans has been weakly solved [28]. Then, in

2017, for the ﬁrst time, two poker AIs called DeepStack [22] and Libratus [14] [11], which

had been developed independently and concurrently, defeated with statistical signiﬁcance pro-

fessional human poker players at the game of Heads-Up No-Limit Texas Hold’em, the most

popular form of two player poker that is several orders of magnitude larger in comparison to

its limit variant.

The goal of this project is to understand and implement as much of the methods used in

these poker AIs as possible, but on simpliﬁed games. I have implemented the Counterfactual

Regret Minimization (CFR) algorithm, along with the variations CFR+ [27] and LCFR [13],

and applied them to search solution in the game of Rock-Paper-Scissors, Coin Toss, Kuhn

Poker and Leduc Hold’em, and compare their performance.

1.1 Background

Imperfect information games model sequential interactions between multiple agents where pri-

vate information is involved. Poker is the quintessential game of imperfect information, the

ﬁrst of this kind used as an example in the game theory ﬁeld [6], and arguably the most popular

card game in the world with millions of players worldwide. In poker, players have knowledge

about the current state of the game that is not shared with their opponents. Therefore, a player

has to act under uncertainty, but they can also leverage that by taking advantage over their own

opponent’s uncertainty. In poker, each player is being dealt private cards (called a hand) and

wagers on having the strongest hand by possibly blufﬁng, calling opponent bets or folding to

give up their hand. Poker captures the challenge of inferring the decisions of the opponent

without letting them know our intentions. Compared to perfect information games, where all

available information needed to infer a strategy is always present on the board, poker is a game

that requires more complex reasoning. For example, the ﬁrst player to act does not know which

cards the other player has been dealt and therefore has to behave the same regardless of his op-

ponent’s private information. As the game progresses, more private information of each player

is revealed through their actions and being signiﬁed in each others beliefs. The optimal deci-

sion in a particular state will, therefore, depend upon the probability distribution over private

information that each player holds over the other. This is akin to recursive reasoning and algo-

rithms which tackle the game of poker have to handle information asymmetry. This is the key

reason why algorithms that are competitive in games of perfect information will adapt poorly

to games of imperfect information.

For John Von Neumann, one of the founders of modern game theory, poker was a com-

pelling representation of how he viewed games in real life, more than just a well-deﬁned form

of computation, but with tactics of blufﬁng and deception, and without precise solutions [1].

One particular challenge for him was that probability theory alone is not enough to play poker

competitively, which inspired his desire to formalize the idea of ”blufﬁng” [6]. Such an idea is

captured in the following way: in order to construct an effective strategy in poker the amount

currently in the pot, as well as the beliefs of both players need to be taken into consideration at

every decision point.

1.2 The Game

Heads-up no-limit Texas Hold’em (HUNL) is a two-player version of the game of poker, which

has become a primary benchmark for imperfect-information games [13]. In HUNL each player

is being dealt face-down two cards, called the hand, from a standard 52-card deck. The ﬁrst

round is called the pre-ﬂop, and each player starts with a stack of a ﬁxed amount of chips.

The game proceeds with a few rounds of betting as additional cards are dealt face-up in three

subsequent rounds: the ﬂop, turn and river. Both players are wagering on having the strongest

hand at the end of the game. The strength of a hand depends on matching the cards on the table

with the ones held privately. There is no limit placed on the size of the bets, which radically

increases the number of possible actions in each decision point. Instead of being restricted to

a few betting amounts, players are allowed to bet as much as they want and can even go all-in,

wagering every remaining chip in the bet. At the end of the rounds, unless one player folds their

hand and forfeits their wagering amount from the pot to their opponent, there is a showdown,

and the strongest hand wins the amount in the pot. Players are then being dealt another set of

hands, and the game is repeated. The goal of each player is to win as many chips as possible

from the other player over a certain amount of hands played.

The games being investigated in this project are simpliﬁed versions of HUNL that retain the

same level of mechanical complexity, with a key difference being a reduction in the number

of rounds and the size of the deck of cards being used. The standard performance measure

has been standardized to milli-big-blinds per game, or mbb/g, a unit agnostic to the amount of

money bets being wagered.

AI techniques have shown superhuman performance in games with very large search spaces

such as Go (with 10

170

gate states), Shogi (10

), Chess(10

) [8], in Atari video games (learn-

ing directly from high-dimensional sensory inputs) [21] and in real-time strategy games [5].

These achievements require capabilities such as scalability, robustness to exploitation, versatil-

ity and learning without human supervision. Besides being a testbed where these capabilities

can be challenged, poker is a very well grounded problem in theoretical analysis [6]. Further-

more, poker is a massive game on its own, with the most popular variant of No-Limit Heads-Up

Texas Hold’em having a game tree of size in the order of 10

160

, with the added complexity of

hidden information and stochastic outcomes [22].

1.3 Implication

The main representation used within computations for poker is the extensive-form sequential

game, a powerful mathematical structure which can also capture interactions such as security,

negotiations, business, medical treatments, strategic pricing, military applications and many

more. This has implications in the transitivity of the methods employed in solving poker.

Due to their generality and little domain knowledge requirement, they can be applied to any

other ﬁeld as long as there exists an extensive-form representation that faithfully translates the

interactions involved. Once this is achieved, an equilibrium ﬁnding algorithm can handle the

search for an optimal solution in the game.

1.4 Solving Poker Games

Before 2006, general-purpose linear programming was used to solve small variants of poker

[23], but since then two families of dramatically more scalable equilibrium-ﬁnding algorithms

and problem representations have been developed for two-player zero-sum games. These algo-

rithms are gradient descent with decomposed problem representation and counterfactual regret

minimization (CFR), the latter which can be compared to a form of self-play similar to rein-

forcement learning using an advantage function [9]. Poker games can be naturally modelled as

an extensive game, with abstraction commonly employed to reduce the size of the game tree

[10]. Abstraction was used both explicitly in Libratus and implicitly in DeepStack in order to

achieve superhuman performance.

Rational play in such a game is deﬁned by a Nash equilibrium, a proﬁle of strategies for

each player which indicates for each information set where it is the player’s turn, the probability

with which to take each of the available actions [23]. In a Nash equilibrium, no player can

improve by shifting to a different strategy. Imperfect information is represented by information

sets, which are collections of game states indistinguishable from each other for the player

whose turn it is. Thus, players have to act on all the nodes in an information set at each decision

point [12]. Exploitability is the measure between a strategy and its worst-case opponent. In

Heads-Up games, the worst case opponent has zero exploitability and is regarded as having

perfect play. Finding a solution in this context means searching for a strategy which exhibits

rational play and has low exploitability.

Extensive form games provide a general model of sequential multiagent interaction in a

compact manner [10]. The intuition behind them relies on the representation of decisions

inside the game as paths on the branches of a tree, where nodes are game states and edges are

valid actions for a player to take at that point. Each non-terminal node has an associated player

taking actions while each terminal node has a utility function assigning payoffs. Information

sets are game states which a particular player cannot distinguish between, because of the private

information of the other players. A description of the formal model used for computations in

imperfect-information games is presented in the following paragraphs.

1.5 Extensive form games

Two player zero-sum games have a set of two players P = {1, 2}. A ﬁnite set H represents all

possible histories in the game as a sequence of actions, which can be thought of as nodes in a

game tree. Z ⊆ H is the set of terminal histories. For each element h ∈ H the action function A

deﬁnes the set of available actions of that node as A(h). When an action a leads from a node h

to h

this is written as h · a = h

. P(h) ∈ P ∪ c deﬁnes the player which acts on that node, with

c representing chance. When referring to a player p ∈ P , his opponent is annotated as −p .

Every node z ∈ Z has an associated utility function u(z) which assigns a payoff to each player.

Zero-sum games have the property that u

(z) = −u

−p

(z) for all z ∈ Z.

Imperfect information is represented by information sets, a partition of H for each player

p ∈ P such that all nodes h and h

from the information set I

are indistinguishable from each

other. Therefore, histories in information set I

differ only by taking the private information of

player −p into account.

Each acting player p has an associated strategy σ

which is a probability vector over the

available actions in information set I. The probability of choosing a speciﬁc action a when

in I is given by σ(I,a). A strategy proﬁle σ is a tuple (σ

,σ

−p

) and the associated utility

(σ

,σ

−p

) represents the expected payoff for p if both players follow that strategy. Let the

probability of reaching history h if both players follow σ be denoted by π

(h) and let π

(h)

be the contribution of p to this probability (the probability of reaching h if chance and −p

play deterministically while p plays only according to σ

). Similarly, deﬁne π

(h,h

) the

probability of reaching h

given h has already been reached. From this, it follows that if both

players act according to σ then the probability of reaching I is π

(I) =

∑

h∈I

(h) and that

if only −p chooses actions leading towards I, but chance and p play according to σ

, then

(I) =

∑

h∈I

(h).

The expected utility of player p given history h and assuming both players follow σ is

denoted by u

(h) =

∑

z∈Z

(π

(h,z)u

(z)). A best response to σ

is a strategy BR(σ

) such

that u

(σ

,BR (σ

)) = max

−p



,σ

−p



. A Nash equilibrium is a strategy proﬁle σ

∗

where

no single player can improve their utility by changing strategies and therefore plays a best

response. Formally, σ

∗

satisﬁes ∀p,u



∗

,σ

∗

−p



= max

∈Σ



,σ

∗

−p



. A strategy proﬁle

is part of an ε−Nash equilibrium if the best counter-strategy can at most beat it by ε, formally

written as ∀p,u



∗

,σ

∗

−p



+ ε ≥ max

∈Σ



,σ

∗

−p



In a two-player zero-sum game the exploitability of a strategy is measured by how much

worse it performs against a best response compared to a Nash equilibrium strategy. Formally,

the exploitability of strategy σ

is written as u

(σ

∗

) −u

(σ

,BR (σ

)).

1.6 Counterfactual Regret Minimization

Regret is an online learning concept that indicates how much better a player would have done

had it always played a speciﬁc action in a speciﬁc decision, instead of playing according to its

predeﬁned strategy. Counterfactual regret minimization (CFR) is an iterative algorithm, which

at every iteration t ∈ {1,...,T } computes a strategy σ

for all players p ∈ P , whose average

strategy σ

converges to an approximate Nash equilibrium as T → ∞ in two-player zero sum

games [10]. The counterfactual value for player p in a node h when players adopt the strategy

σ is

(h) =

∑

z∈Z

(π

(h,z)u

(z))

The counterfactual value of an information set is the weighted average of the values of the

nodes inside it. A node is given weight based on the belief of a player being in that node.

Formally, the counterfactual value of an information set is given by

) =

∑

h∈I



−p

(h)v

(h)



∑

h∈I

−p

(h)

while an action-dependent counterfactual value has the form

,a) =

∑

h∈I



−p

(h)v

(h ·a)



∑

h∈I

−p

(h)

The instantaneous regret r

(I,a) of player p for action a ∈ A(I) on iteration t is how much better

off p would have been had it always chosen action a in I and played to get to I but according

to σ

thereafter, and is formally deﬁned as

(I,a) = π

−p

(I)



(I,a) − v

(I)



The overall regret on iteration T represents the regret accumulated so far and is deﬁned as

(I,a) =

∑

t=1

(I,a)

Regret matching is an algorithm for choosing actions in the future in proportion to positive

regret, or how much better the outcome could have been. CFR determines the strategy in

one iteration by applying regret matching to each information set [3] [7]. Formally, on each

iteration t + 1, player p chooses action a based on

t+1

(I,a) =







(I,a)

∑

a∈A(I)

(I, ˜a)

∑

˜a

(I, ˜a)

> 0

|A(I)|

otherwise

where x

= max(x, 0). The average strategy for an information set I on iteration T is deﬁned

(I) =

∑

t=1



(I)σ

(I)



∑

t=1

(I)

and has the property of forming an ε− Nash equilibrium as T → ∞. In two-player zero-sum

games, if both players have average regret

≤ ε, then their average strategies (σ

,σ

−p

) forms

a 2ε−Nash equilibrium. Thus, CFR constitutes an anytime algorithm for ﬁnding an ε− Nash

equilibrium in two-player zero-sum games.

1.7 Advancements of CFR

There are many improvements over the vanilla CFR algorithm that have been shown to achieve

much faster performance. CFR+ is a variation which changes the overall regret and average

strategy equations in order to converge an order of magnitude faster [27]. First, instead of

simultaneous updates for both players, CFR+ updates average strategies σ

alternatively for

each player and weights each iteration t by multiplying it with t

. Secondly, after each iteration,

all actions with negative regret are set to zero. Formally, CFR+ replaces the overall regret

(I,a)

with

(I,a) = max



0,Q

T −1

(I,a) + r

(I,a)



Despite its advantages, CFR+ is sensitive to variance and is therefore difﬁcult to use with sam-

pling and function approximation [4]. Linear CFR (LCFR) is a form of Discounted Regret

Minimization which is identical to vanilla CFR, except on iteration t the updates to the instan-

taneous regrets r

and average strategies σ

are given weight t . Formally, in LCFR, the regret

is deﬁned as

(I,a) =

∑

t=1



(I,a)



while the average strategy is deﬁned as

(I) =

∑

t=1



tπ

(I)σ

(I)



∑

t=1



tπ

(I)



CFR variants have focused on improvements that allow them to ﬁnd solutions in ever bigger

games. CFR+ was used to near-optimally solve Heads-up Limit Texas Hold’em, a version

of poker with 10

unique decision points [28]. DeepStack was the ﬁrst agent able to beat

top human poker players at the game of Heads-up No-Limit Texas Hold’em, the full version

with 10

160

decision points, by utilizing a hybrid of vanilla CFR and CFR+. In DeepStack, the

regret minimizing algorithm uses uniform weighting and simultaneuos updates like CFR but

which computes the ﬁnal average strategy and average counterfactual values by omitting early

iterations [22]. CFR+ was also used to approximately solve HUNL endgames in Libratus, a

poker agent which defeated HUNL top professionals [14] [12]. LCFR is faster than CFR+ in

certain settings with wide distributions in payoffs and tolerates errors far better [13], leading to

drastically faster convergence without increasing the computation cost per iteration.

1.8 Superhuman performance in Poker Games

The most notable advancement in the recent past is that algorithms which are able to compete

with humans at HUNL employ strategies computed online during play instead of being re-

trieved from an ofﬂine pre-computation. Such methods, which are called real-time solving for

Libratus and re-solving for DeepStack, have been developed independently and concurrently.

The similarity between them is that they both solve nested subgames of poker as they appear

during play. DeepStack solves a depth-limited subgame during the betting rounds by estimat-

ing the opponent counterfactual values at the depth limit with a deep neural network. Libratus

plays according to a detailed pre-computed strategy at the beginning of the game and rounds

off unexpected opponent actions to a nearby abstraction. When the game is nearing the end, it

then creates a much more detailed abstraction of the subgame in real-time and solves it taking

into consideration the broader context of the original abstraction.

The common challenge for both DeepStack and Libratus is incorporating the online com-

putations of the current subgame within the broader game. Solving subgames in imperfect-

information games cannot be done in isolation because their optimal strategy may depend on

other, unreached subgames, that are related to the current game state. Another challenge that

both agents had to overcome is considering how the opponent can adapt to changes in the

strategies. For example, always betting with good hands is a bad strategy since the opponent

can learn to fold our bets. Another example is that when abstractions are incorporated, players

can exploit them and make strategies which play actions that are the most overlooked by the

abstraction. Libratus incorporates a self-improvement module which enhances its abstraction

as players search for holes in the strategy by computing game-theoretic strategies for miss-

ing branches in the abstraction. Finally, states in imperfect-information games do not have a

single well-deﬁned value, but depend on aspects such as how players have reached the state.

DeepStack learns a counterfactual value function with deep neural networks that approximates

outcomes in a state using players’ ranges, which are probability distributions over possible

hands given that the public state is reached. Furthermore, rather than a single value, the deep

counterfactual value network outputs a vector of counterfactual values v

, where j iterates over

each possible hand combination.

Chapter 2

Development

This project investigates the empirical performance of various CFR and Regret Matching algo-

rithms on simpliﬁed imperfect-information games.

2.1 Rock Paper Scissors

Rock Paper Scissors (RPS) is a two player zero-sum game where each player simultaneously

makes one of three choices: rock (R), paper (P) and scissors (S). Each choice gives the opportu-

nity to win, lose or draw against the opponent. Action R is better than S, which in turn is better

than R, which in turn is better than P. Every choice has a counter-choice, making the game a

paradigm for studying competition caused by cyclical dominance. The goal of the game is to

beat the opponent’s strategy.

Formally, cyclical dominance can be expressed quantitatively with a payoff matrix. The

game has a set of two players P = {P

}, a ﬁnite set of actions for each player p ∈ P called

, a set of possible combinations of actions A = S

× ... × S

and a function u which maps

each action combination to a vector of utilities for each player. Winning has an utility of 1,

losing has an utility of -1 and a draw has 0. RPS can be represented as normal-form game, an

n-dimensional table where each row and column corresponds to a single player’s actions, and

each entry in the table is a vector of payoff for each player. The payoff table for RPS is as

follows:

R P S

R 0,0 -1, 1 1, -1

P 1, -1 0, 0 -1, 1

S -1, 1 1, -1 0, 0

Table 2.1: Rock-Paper-Scissors payoff

Even though the game is played simultaneously, an alternative representation which is

equally valid is to consider that one player acts ﬁrst and then second player has incomplete

information about the state of the game and has to make a choice without knowing the private

decision of the ﬁrst player.

In an iterated game of RPS, if one player always chooses some action, their opponent can

exploit them and learn to always choose their counter-action. If one player always chooses

between two actions, their opponent can take advantage of such a regularity. To avoid exploita-

tion, and to reach a Nash equilibrium, players have to adopt a strategy which distributes their

Rock

Paper

Scissors

Rock

Paper

Scissors

Paper

Rock

Paper

Scissors

’s information set



 

−1

 

−1





−1





 

−1

 

−1

 

−1

 



’s payoff



Figure 2.1: Rock-Paper-Scissors shown as an extensive form game

actions evenly across the three options. This learning process is called regret matching, and is

one of the key components of learning in imperfection games. The main idea behind it is that

for an agent to learn how to play RPS well, it will select actions at random with a distribution

that is proportional to positive regret, i.e. how much the agent wishes to have taken that action

in the past. Ideally, the agent will minimize its expected regret over time, learn a best response

strategy and converge to a correlated equilibrium.

Rock-Paper-Scissors+ (RPS+) [16] is a variation of RPS in which instead of uniform out-

comes, if either player wins with Scissors, their payoff becomes 2 instead of 1, while their

opponent’s payoff becomes -2. This incentivizes, quite contrary to the intuition, an increased

probability of choosing both Rock and Paper in order to maintain the Nash equilibrium. By

making the action Scissors more attractive, it becomes more exploitable. In order to remove

exploitability, a player wishing to play Nash equilibrium will have to adopt a strategy of play-

ing Rock, Paper and Scissors with probability (0.4, 0.4, 0.2). RPS+ was deﬁned by Brown et al

in 2018 in order to exemplify the challenge of depth-limited solving in imperfect-information

games.

R P S

R 0,0 -1, 1 2, -2

P 1, -1 0, 0 -2, 2

S -2, 2 2, -2 0, 0

Table 2.2: Rock-Paper-Scissors+ is a variation where the payoffs are biased towards Scissors

The next chapter includes experimental results showing the convergence rate of the CFR

algorithm as a function of exploitability, as well as its variations CFR+ and LCFR, on the games

of RPS and RPS+.

2.2 Coin Toss

Coin Toss is a simple two-player game that was ﬁrst introduced in 2017 by Brown et al as an

example for why imperfect information subgames cannot be solved in isolation [12]. Subgame

solving is a key component of state of the art poker AIs and this project shows how CFR can be

used to learn a strategy in the entire game of Coin Toss. Furthermore, I present the relationship

between the strategy proﬁle learned with CFR and the usage of subgame solving.

Coin

EV = 0.5

Sell

−1

Heads

For f eit

Tails

Play

Heads

EV = −0.5

Sell

Heads

For f eit

−1

Tails

Play

Tails

Figure 2.2: Coin Toss in extensive form. The dashed line represents P

’s information set,

meaning P

cannot distinguish between the two states. The payoffs are shown only for P

, P

receives the negation of P

’s payoff.

In Coin Toss, there are two players called P

and P

and a coin, which has equal probability

of landing Heads or Tails. The coin is ﬂipped and only P

sees the outcome. Then, P

must

choose between two actions called ”Sell” and ”Play”. The ”Sell” action leads to a subgame

that has the expected value (EV) depending on the whether the coin landed Heads of Tails. If

the coin lands Heads, P

receives an EV of 0.5. On the other hand, if the coin lands Tails, P

receivs an EV of -0.5. Should P

choose the action ”Play”, then P

may guess how the coin

landed. If P

guesses correctly, P

receives an EV of -1, however, if P

guesses incorrectly, P

receives an EV of 1. P

can also choose to forfeit, in which case P

also receives en EV of

1. The optimal strategy for P

is to choose Heads with 25% probability and Tails with 75%

probability in the ”Play” subgame, which gives both players an EV of 0. P

can always choose

”Sell” to receive the same EV, or it could choose any mixed combination of ”Play” and ”Sell”

strategies and it would still receive at best and EV of 0. However, should the roles of the

sides of the coin reverse, so will the choices in the strategy of P

. If P

starts to receive an

EV of -0.5 when it sells Heads, and 0.5 when it sells Tails, P

will have to adapt its strategy

by guessing Heads with 75% probability and Tails with 25% probability. This shows that P

’s

strategy depends on P

’s outcome in an unrelated subgame that P

cannot even reach, but which

nevertheless inﬂuences P

’s own strategy and in turn P

’s decisions. Thus, it is impossible to

solve a subgame without using information from other parts of the game. This is the central

challenge of imperfect-information games as opposed to perfect information games.

The CFR algorithm, along with its variants explored in this project, visits the entire game

tree at every iteration, overcoming this limitation. Larger games, such as HUNL, are too big for

CFR to visit all their states even once. The classical solution, which is to visit an abstraction of

the entire game to search for a solution, leads itself to high exploitability. Subgame solving can

be applied to ﬁtting a new solution of a subgame within the overall strategy of a much larger

strategy created for an abstraction. Such methods were used in both DeepStack and Libratus to

win over top human players.

2.3 Kuhn Poker

Kuhn Poker is another simple zero-sum game which is played with a deck of 3 cards: the Jack,

the Queen and the King. The game was created by Harold W. Kuhn in 1950 [19]. Two players,

and P

, each bet 1 chip, called the ante, into the pot, before the they are both dealt one

card from the deck. Their card is held as private information. Each player takes turn in being

Deck

...

Jack

Deck

...

Queen

Deck

...

Jack

Pass

-1

Pass

Bet

Pass

Bet

Queen

King

Figure 2.3: Partial game tree for Kuhn Poker. Payoffs are shown only for P

, P

receives the

opposite of P

’s payoff. The deck of cards is shown as a player which deals a card at random

to P

, then to P

the ﬁrst to take an action. On a turn, a player has the option to either pass or bet. When a

player bets, they put an additional chip into the pot. If a player passes after their opponent bets,

they forfeit their ante and the pot gets awarded to their opponent. However, if there are two

successive passes or two successive bets, there is a showdown and the player with the strongest

hand (higher card) wins and takes all the chips in the pot.

There are six possible ways in which the cards are dealt to the players, and 12 resulting

information sets. The expected value for P

in the Nash equilibrium for this game is −1/18,

which means they will lose on average.

Many of the challenges of HUNL such as blufﬁng with a weak hand, passing a good hand

and avoiding to be exploited by the opponent are all are captured in Kuhn Poker. CFR is an

algorithm which can be used to search for a solution in Kuhn poker, where a solution means a

strategy proﬁle which is part of an ε− Nash equilibrium. This project explores the performance

differences between CFR, CFR+ and LCFR in learning an optimal strategy in Kuhn Poker.

2.4 Leduc Hold’em

Leduc Hold’em is a variant of Texas Hold’em which is sometimes used in academic con-

texts, ﬁrst introduced in [25]. Leduc Hold’em seeks to retain the strategic elements of Texas

Hold’em, while keeping the size of the game smaller and computing exact solutions tractable.

The game is played with a deck of six cards, comprising two suits of three ranks each: the

Jack, the Queen and the King. There are two rounds of betting in the game. Each player starts

with a stack of a ﬁxed amount of chips, and has to put one chip into the pot as an ante. Then,

they are both being dealt a hand made of one card to keep private. The ﬁrst round of betting

takes place, with players having the option of calling to match their opponent’s bet, raising the

bet amount or folding their hand. The raise amount is limited to 2 and 4. Players must call

their opponent’s raise if they want to proceed to the next round. Then, a board card is revealed,

and there is another round of betting. Finally, if no player folds, there is a showdown, and the

strongest hand wins the pot. The strongest hand is the one which has the same rank as the card

on the table. If no player makes a pair with the board card, the strongest hand is the one with

the higher rank.

2.5 No-Limit Leduc Hold’em

A simple extension of Leduc Hold’em is to place no limit on the amount of chips which can be

raised by either player. Every time a player wants to re-raise their opponent’s previous bet, they

have to raise at least double that amount. Very quickly, due to the exponential increase in the

raise amounts, each player has to go all-in with their bets, wagering all their remaining chips.

No-Limit Leduc Hold’em is a much larger game than Leduc Hold’em due to the increase in the

number of actions at each decision point, from a small number of betting actions to a virtually

continuous space of possibilities.

2.6 Deep Counterfactual Value Networks

Deep neural networks (DNNs) are computing systems originally inspired by information pro-

cessing in the biological neural networks that constitute animal brains. DNNs have proven

capabilities such as modelling complex non-linear relationships and as a result they are respon-

sible for signiﬁcant advacements in image and speech recognition [18] and playing games.

Recently, DeepStack has used DNNs to estimate the value of future game states in HUNL, us-

ing millions of solutions to poker sub-games as training data. This was a key component of the

poker AI, enabling it to learn strategies that converge to a Nash equilibrium in HUNL without

using traditional abstraction techniques.

A common approach for solving large imperfect information games (in the order of 10

160

states) is to search for a solution in a smaller, abstracted version, and map strategies back to

the original game. Such strategies are highly exploitable to opponents explicitly looking for

sections in the full game tree that are weakly covered by the abstraction. DeepStack is a strong

poker AI that combines ideas from perfect information games with algorithms such as CFR

and endgame solving [2], while remaining theoretically sound [22]. Endgame solving is a

method for ﬁnding a solution with greater accuracy in a speciﬁc subgame encountered during

play that ﬁts within the abstract equilibrium strategies that are already computed for the initial

portion of the game. Endgame solving alone comes with disadvantages, since the technique

is intractable, unless applied to small game trees, and requires action translation and dealing

with the off-tree actions of the opponent [2]. DeepStack proposes to solve these shortcomings

by continually re-solving the subgame at every action taken, starting from the current state.

Continual re-solving is still not enough to solve a big game such as HUNL, since computing a

solution at the early stages of the game means exploring the entire tree, and this is unfeasible.

For this reason, DeepStack only computes a strategy down to a limited depth of the tree, and

uses deep neural networks to estimate expected counterfactual values for each hand on future

betting rounds in its re-solving step [17]. Consequently, the deep counterfactual value network

(DCVN) is trained with generated random poker situations in order to predict the output of the

CFR algorithm, counterfactual values for every node in the sub-tree.

Chapter 3

Evaluation

The previous chapter introduces a number of two-player zero sum games, in increasing com-

plexity. This chapter presents the evaluation outcomes of learning algorithms, applied to these

games. Strategy proﬁles are plotted against the number of iterations in the training process. In

small poker games, exploitability is plotted as a function of iterations.

3.1 Best Response

First introduced in Chapter 1, Best Response (BR) is a strategy proﬁle that maximizes an

agent’s utility against an adversarial strategy. When all players are playing BR, their strategy

proﬁles makes a Nash equilibrium. For simple games, such as RPS and RPS+, BR can be

calculated directly from the payoff matrix. In the RPS case, each player will have to choose

every action with probability of 1/3. RPS+, on the other hand, has a Nash equilibrium proﬁle

when players adopt a strategy with 0.2 probability of playing Scissors and 0.4 probability of

playing Rock and Paper, respectively.

3.2 Experiments with BR in RPS and RPS+

I conducted experiments by estimating BR strategy proﬁle using Regret Matching in the games

of RPS and RPS+. Average strategy for each action is plotted against number of BR iterations.

The action proﬁle starts evenly distributed, then quickly converges to a Nash equilibrium. The

gray lines provide the exact solutions for the equilibrium that the algorithm is converging to.

Figure 3.1: RPS with 1000 iterations

Figure 3.2: RPS with 10k iterations

Figure 3.3: RPS with 100k iterations

Figure 3.4: RPS with 1 million iterations

Figure 3.5: RPS+ with 1000 iterations

Figure 3.6: RPS+ with 10k iterations

Figure 3.7: RPS+ with 100k iterations

Figure 3.8: RPS+ with 1 million iterations

3.3 Experiments with CFR in Coin Toss

Coin Toss is a game which demonstrates that the strategy one player has to take in a subgame

may depend on a subgame’s expected value in a part of the game tree that the player alone

cannot reach. I used CFR to empirically test this hypothesis and plotted the strategy proﬁle for

for an increasingly higher number of iterations. Then, I switched the expected value of P

’s

”Sell” subgame (which depends on which side the coin lands) to show that P

’s strategy proﬁle

has to switch probability distributions for its actions as well.

Figure 3.9: CFR with 1000 iterations

Figure 3.10: CFR with 10k iterations

Figure 3.11: CFR with 100k iterations

Figure 3.12: CFR with 1 million iterations

Figure 3.13: CFR with 1 million iterations and reversed outcomes for P

’s ”Sell” subgame

3.4 Experiments with CFR, CFR+ and LCFR in Kuhn Poker

The CFR algorithm can be used to ﬁnd a solution (strategy proﬁle that forms a Nash equi-

librium) in the game of Kuhn Poker. Out of the variations used, CFR+ was the fastest in

converging to a solution. At each iteration step, the graphs plot the exploitability of the current

solution in mbb/g (mili-big blinds per game), a measure of how much a player could lose in

expectation using that strategy. Out of the three variations, CFR+ was the quickest to learn a

strategy, followed by LCFR and CFR.

Figure 3.14: CFR

Figure 3.15: LCFR

Figure 3.16: CFR+

3.5 Experiments with CFR, CFR+ and LCFR in Leduc Hold’em

Leduc Hold’em is larger game than Kuhn Poker, which is played with 6 cards instead of 3

and has an extra round of betting. Exploitability is plotted on a log axis against the number

of iterations of the CFR variations. Once again, CFR+ was the quickest algorithm to learn a

strategy, followed by LCFR.

Figure 3.17: CFR

Figure 3.18: LCFR

Figure 3.19: CFR+

Chapter 4

Conclusions and related work

This project begins with an introduction to game theory, imperfect-information games and

poker. Then, it presents the learning algorithms that can be used to search for solutions in

these games, and explains what a good solution represents. It then describes the importance,

as well as the applicability, of the learning methods presented in real world cases. Then, it

investigates current state of the art methods for learning in No-Limit games, a key discovery

over the last few years which enables learning in much larger games than previously possible.

The following chapters introduce a number of notable games and presents the empirical results

of applying learning methods to search for solutions in them.

The key achievements of this project are the implementations of Regret Matching to ﬁnd a

Best Response strategy in RPS/RPS+ and that of CFR (along with CFR+ and LCFR) to ﬁnd a

Nash equlibrium in the games of Coin Toss, Kuhn Poker and Leduc Hold’em.

Related work includes using various sampling techniques to approximate applying the CFR

algorithm to the entire tree by only visiting a small subset of the branches. Moreover, new ap-

plications of Deep Learning to CFR [15] [26] present a different approach of merging classical

game-theoretic algorithms with non-linear function approximation methods.

The planning of this project is divided into two parts. In the ﬁrst part I was involved with

reading background material on state-of-the-art methods and building my understanding of

the subject. In the second part I was concerned with replicating as much of the results as

possible. Due to time constrains and the vast reading required to understand the state-of-the-

art, only part of the learning methods have been replicated. The work can be extended by

using Deep Counterfactual Value Networks to explore solutions in No-Limit betting games.

Moreover, deep learning can be applied directly to CFR in order to intrinsically abstract actions

and increase the scalability of the tree traversal.

Bibliography

[1] J. bronowski, the ascent of man, documentary (1973). episode 13.

[2] S. ganzfried and t. sandholm, ”endgame solving in large imperfect-information games”,

carnegie mellon university, in: Proceedings of the 14th international conference on au-

tonoumous agents and multiagent systems (aaamas 2015), 2015.

[3] Kamalika chaudhuri, yoav freund, and daniel j hsu. a parameter-free hedging algorithm.

in advances in neural information processing systems, pages 297305, 2009.

[4] http://mlanctot.info/ﬁles/papers/nips09mccfr.pdf.

[5] https://openai.com/blog/openai-ﬁve/.

[6] https://cs.stanford.edu/people/eroberts/courses/soco/projects/1998-99/game-

theory/neumann.html.

[7] Nick littlestone and m. k. warmuth. the weighted majority algorithm. information and

computation, 108(2):212261, 1994.

[8] https://en.wikipedia.org/wiki/game complexity.

[9] https://news.ycombinator.com/item?id=13558170.

[10] http://poker.cs.ualberta.ca/publications/nips07-cfr.pdf.

[11] N. Brown and T. Sandholm. Libratus: The superhuman ai for no-limit poker (demonstra-

tion).

[12] N. Brown and T. Sandholm. Safe and nested subgame solving for imperfect-information

games, 2017.

[13] N. Brown and T. Sandholm. Solving imperfect-information games via discounted regret

minimization, 2018.

[14] N. Brown and T. Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats

top professionals, 2018.

[15] N. Brown, A. Lerer, S. Gross, and T. Sandholm. Deep counterfactual regret minimization,

2018.

[16] N. Brown, T. Sandholm, and B. Amos. Depth-limited solving for imperfect-information

games, 2018.

[17] P. Hopner and E. L. Menca. Analysis and optimization of deep counterfactual value

networks. 2018. doi: 10.1007/978-3-030-00111-7 26.

[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Advances in Neural Information Process-

ing Systems. pp. 11061114, 25, 2012.

[19] H. W. Kuhn. Simpliﬁed two-person poker. In Harold W. Kuhn and Albert W. Tucker, edi-

tors, Contributions to the Theory of Games, volume 1, pages 97103., Princeton University

Press, 1950.

[20] D. N. L. Levy and M. Newborn. How Computers Play Chess. Ishi Press, 2009.

[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Ried-

miller. Playing Atari with Deep Reinforcement Learning. DeepMind, 2013.

[22] M. Moravcik, M. Schmid, N. Burch, V. Lisy, D. Morrill, N. Bard, T. Davis, K. Waugh,

M. Johanson, and M. Bowling. Deepstack: Expert-level artiﬁcial intelligence in no-limit

poker. 2017. doi: 10.1126/science.aam6960.

[23] T. Sandholm. Solving imperfect-information games. Science, 347(6218):122–123, 2015.

ISSN 0036-8075. doi: 10.1126/science.aaa4614. URL https://science.sciencemag.

org/content/347/6218/122.

[24] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrit-

twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham,

N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and

D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Na-

ture, 529(7587):484–489, Jan. 2016. doi: 10.1038/nature16961.

[25] F. Southey, M. Bowling, B. Larson, C. Piccione, N. Burch, D. Billings, and C. Rayner.

Bayes Bluff: Opponent Modelling in Poker. University of Alberta.

[26] E. Steinberger. Single deep counterfactual regret minimization, 2019.

[27] O. Tammelin. Solving large imperfect information games using cfr+, 2014.

[28] O. Tammelin, N. Burch, M. Johanson, and M. Bowling. Solving heads-up limit texas

holdem. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence

(IJCAI), pages 645652, 2015.

Comments

Products

Project