University of Manchester
School of Computer Science
Project Report 2019
Learning strategies in large imperfect-information games
Author: Paul Chelarescu
Supervisor: Dr. Jon Shapiro
Abstract
Learning strategies in large imperfect-information games
Author: Paul Chelarescu
The main purpose of this project is to replicate the abilities of the first poker agents which were
recently able to beat top human players at the game of No-Limit Texas Hold’em. The results
are significant for the advancement of the artificial intelligence community and it represents
the first game where bluffing was a necessary technique to win and which was remarkably well
learned by the agents. Several techniques have been used, and this project’s purpose was to
incrementally build upon core concepts and algorithms employed by the agents and replicate as
much as possible from the most important components given the limited time. The evaluation
was achieved with a measurement used within specific games, called exploitability. Various
visualizations of the performance of the algorithms used to learn strategies are presented.
Supervisor: Dr. Jon Shapiro
Chapter 1
Introduction
The field of Artificial Intelligence has experienced considerable advancements over the recent
decades, and in many cases the progress has been measured by performance that exceeds that
of humans in games such as checkers, chess and Go. The success in these games is based on
the assumption of perfect information, or information symmetry, where all players have the
same information about the current state of the game. Since determining the optimal strategy
only requires knowledge of the current state, this fact has been leveraged by nearly every AI
developed for perfect-information games, including AIs that defeated top humans in chess and
Go [28]. This property had a direct implication for the methods being used to find solutions
in these games, such as minimax search with alpha-beta pruning [20] and Monte Carlo Tree
Search with policy evaluation [24]. Up to 2015, every non-trivial game that had been solved is
a perfect-information game [28].
Despite these successes, and unlike parlor games, decision making in the real world can
rarely assume settings with perfect information. Much more realistic interactions between
multiple agents include the lack of complete knowledge of past events and bluffing with in-
formation held only privately. Dealing with hidden information requires drastically different
AI techniques. In 2015, it was announced that Heads-Up Limit Texas Hold’em, the smallest
variant of poker that is routinely played by humans has been weakly solved [28]. Then, in
2017, for the first time, two poker AIs called DeepStack [22] and Libratus [14] [11], which
had been developed independently and concurrently, defeated with statistical significance pro-
fessional human poker players at the game of Heads-Up No-Limit Texas Hold’em, the most
popular form of two player poker that is several orders of magnitude larger in comparison to
its limit variant.
The goal of this project is to understand and implement as much of the methods used in
these poker AIs as possible, but on simplified games. I have implemented the Counterfactual
Regret Minimization (CFR) algorithm, along with the variations CFR+ [27] and LCFR [13],
and applied them to search solution in the game of Rock-Paper-Scissors, Coin Toss, Kuhn
Poker and Leduc Hold’em, and compare their performance.
1.1 Background
Imperfect information games model sequential interactions between multiple agents where pri-
vate information is involved. Poker is the quintessential game of imperfect information, the
first of this kind used as an example in the game theory field [6], and arguably the most popular
card game in the world with millions of players worldwide. In poker, players have knowledge
1
about the current state of the game that is not shared with their opponents. Therefore, a player
has to act under uncertainty, but they can also leverage that by taking advantage over their own
opponent’s uncertainty. In poker, each player is being dealt private cards (called a hand) and
wagers on having the strongest hand by possibly bluffing, calling opponent bets or folding to
give up their hand. Poker captures the challenge of inferring the decisions of the opponent
without letting them know our intentions. Compared to perfect information games, where all
available information needed to infer a strategy is always present on the board, poker is a game
that requires more complex reasoning. For example, the first player to act does not know which
cards the other player has been dealt and therefore has to behave the same regardless of his op-
ponent’s private information. As the game progresses, more private information of each player
is revealed through their actions and being signified in each others beliefs. The optimal deci-
sion in a particular state will, therefore, depend upon the probability distribution over private
information that each player holds over the other. This is akin to recursive reasoning and algo-
rithms which tackle the game of poker have to handle information asymmetry. This is the key
reason why algorithms that are competitive in games of perfect information will adapt poorly
to games of imperfect information.
For John Von Neumann, one of the founders of modern game theory, poker was a com-
pelling representation of how he viewed games in real life, more than just a well-defined form
of computation, but with tactics of bluffing and deception, and without precise solutions [1].
One particular challenge for him was that probability theory alone is not enough to play poker
competitively, which inspired his desire to formalize the idea of ”bluffing” [6]. Such an idea is
captured in the following way: in order to construct an effective strategy in poker the amount
currently in the pot, as well as the beliefs of both players need to be taken into consideration at
every decision point.
1.2 The Game
Heads-up no-limit Texas Hold’em (HUNL) is a two-player version of the game of poker, which
has become a primary benchmark for imperfect-information games [13]. In HUNL each player
is being dealt face-down two cards, called the hand, from a standard 52-card deck. The first
round is called the pre-flop, and each player starts with a stack of a fixed amount of chips.
The game proceeds with a few rounds of betting as additional cards are dealt face-up in three
subsequent rounds: the flop, turn and river. Both players are wagering on having the strongest
hand at the end of the game. The strength of a hand depends on matching the cards on the table
with the ones held privately. There is no limit placed on the size of the bets, which radically
increases the number of possible actions in each decision point. Instead of being restricted to
a few betting amounts, players are allowed to bet as much as they want and can even go all-in,
wagering every remaining chip in the bet. At the end of the rounds, unless one player folds their
hand and forfeits their wagering amount from the pot to their opponent, there is a showdown,
and the strongest hand wins the amount in the pot. Players are then being dealt another set of
hands, and the game is repeated. The goal of each player is to win as many chips as possible
from the other player over a certain amount of hands played.
The games being investigated in this project are simplified versions of HUNL that retain the
same level of mechanical complexity, with a key difference being a reduction in the number
of rounds and the size of the deck of cards being used. The standard performance measure
has been standardized to milli-big-blinds per game, or mbb/g, a unit agnostic to the amount of
2
money bets being wagered.
AI techniques have shown superhuman performance in games with very large search spaces
such as Go (with 10
170
gate states), Shogi (10
70
), Chess(10
47
) [8], in Atari video games (learn-
ing directly from high-dimensional sensory inputs) [21] and in real-time strategy games [5].
These achievements require capabilities such as scalability, robustness to exploitation, versatil-
ity and learning without human supervision. Besides being a testbed where these capabilities
can be challenged, poker is a very well grounded problem in theoretical analysis [6]. Further-
more, poker is a massive game on its own, with the most popular variant of No-Limit Heads-Up
Texas Hold’em having a game tree of size in the order of 10
160
, with the added complexity of
hidden information and stochastic outcomes [22].
1.3 Implication
The main representation used within computations for poker is the extensive-form sequential
game, a powerful mathematical structure which can also capture interactions such as security,
negotiations, business, medical treatments, strategic pricing, military applications and many
more. This has implications in the transitivity of the methods employed in solving poker.
Due to their generality and little domain knowledge requirement, they can be applied to any
other field as long as there exists an extensive-form representation that faithfully translates the
interactions involved. Once this is achieved, an equilibrium finding algorithm can handle the
search for an optimal solution in the game.
1.4 Solving Poker Games
Before 2006, general-purpose linear programming was used to solve small variants of poker
[23], but since then two families of dramatically more scalable equilibrium-finding algorithms
and problem representations have been developed for two-player zero-sum games. These algo-
rithms are gradient descent with decomposed problem representation and counterfactual regret
minimization (CFR), the latter which can be compared to a form of self-play similar to rein-
forcement learning using an advantage function [9]. Poker games can be naturally modelled as
an extensive game, with abstraction commonly employed to reduce the size of the game tree
[10]. Abstraction was used both explicitly in Libratus and implicitly in DeepStack in order to
achieve superhuman performance.
Rational play in such a game is defined by a Nash equilibrium, a profile of strategies for
each player which indicates for each information set where it is the player’s turn, the probability
with which to take each of the available actions [23]. In a Nash equilibrium, no player can
improve by shifting to a different strategy. Imperfect information is represented by information
sets, which are collections of game states indistinguishable from each other for the player
whose turn it is. Thus, players have to act on all the nodes in an information set at each decision
point [12]. Exploitability is the measure between a strategy and its worst-case opponent. In
Heads-Up games, the worst case opponent has zero exploitability and is regarded as having
perfect play. Finding a solution in this context means searching for a strategy which exhibits
rational play and has low exploitability.
Extensive form games provide a general model of sequential multiagent interaction in a
compact manner [10]. The intuition behind them relies on the representation of decisions
inside the game as paths on the branches of a tree, where nodes are game states and edges are
3
valid actions for a player to take at that point. Each non-terminal node has an associated player
taking actions while each terminal node has a utility function assigning payoffs. Information
sets are game states which a particular player cannot distinguish between, because of the private
information of the other players. A description of the formal model used for computations in
imperfect-information games is presented in the following paragraphs.
1.5 Extensive form games
Two player zero-sum games have a set of two players P = {1, 2}. A finite set H represents all
possible histories in the game as a sequence of actions, which can be thought of as nodes in a
game tree. Z H is the set of terminal histories. For each element h H the action function A
defines the set of available actions of that node as A(h). When an action a leads from a node h
to h
0
this is written as h · a = h
0
. P(h) P c defines the player which acts on that node, with
c representing chance. When referring to a player p P , his opponent is annotated as p .
Every node z Z has an associated utility function u(z) which assigns a payoff to each player.
Zero-sum games have the property that u
p
(z) = u
p
(z) for all z Z.
Imperfect information is represented by information sets, a partition of H for each player
p P such that all nodes h and h
0
from the information set I
p
are indistinguishable from each
other. Therefore, histories in information set I
p
differ only by taking the private information of
player p into account.
Each acting player p has an associated strategy σ
p
which is a probability vector over the
available actions in information set I. The probability of choosing a specific action a when
in I is given by σ(I,a). A strategy profile σ is a tuple (σ
p
,σ
p
) and the associated utility
u
p
(σ
p
,σ
p
) represents the expected payoff for p if both players follow that strategy. Let the
probability of reaching history h if both players follow σ be denoted by π
σ
(h) and let π
σ
p
(h)
be the contribution of p to this probability (the probability of reaching h if chance and p
play deterministically while p plays only according to σ
p
). Similarly, define π
σ
(h,h
0
) the
probability of reaching h
0
given h has already been reached. From this, it follows that if both
players act according to σ then the probability of reaching I is π
σ
(I) =
hI
π
σ
(h) and that
if only p chooses actions leading towards I, but chance and p play according to σ
p
, then
π
σ
p
(I) =
hI
π
σ
p
(h).
The expected utility of player p given history h and assuming both players follow σ is
denoted by u
σ
p
(h) =
zZ
(π
σ
(h,z)u
p
(z)). A best response to σ
p
is a strategy BR(σ
p
) such
that u
p
(σ
p
,BR (σ
p
)) = max
σ
0
p
u
i
σ
p
,σ
0
p
. A Nash equilibrium is a strategy profile σ
where
no single player can improve their utility by changing strategies and therefore plays a best
response. Formally, σ
satisfies p,u
p
σ
p
,σ
p
= max
σ
0
p
Σ
p
u
p
σ
0
p
,σ
p
. A strategy profile
is part of an εNash equilibrium if the best counter-strategy can at most beat it by ε, formally
written as p,u
p
σ
p
,σ
p
+ ε max
σ
0
p
Σ
p
u
p
σ
0
p
,σ
p
.
In a two-player zero-sum game the exploitability of a strategy is measured by how much
worse it performs against a best response compared to a Nash equilibrium strategy. Formally,
the exploitability of strategy σ
p
is written as u
p
(σ
) u
p
(σ
p
,BR (σ
p
)).
1.6 Counterfactual Regret Minimization
Regret is an online learning concept that indicates how much better a player would have done
had it always played a specific action in a specific decision, instead of playing according to its
4
predefined strategy. Counterfactual regret minimization (CFR) is an iterative algorithm, which
at every iteration t {1,...,T } computes a strategy σ
t
p
for all players p P , whose average
strategy σ
T
converges to an approximate Nash equilibrium as T in two-player zero sum
games [10]. The counterfactual value for player p in a node h when players adopt the strategy
σ is
v
σ
p
(h) =
zZ
(π
σ
(h,z)u
p
(z))
The counterfactual value of an information set is the weighted average of the values of the
nodes inside it. A node is given weight based on the belief of a player being in that node.
Formally, the counterfactual value of an information set is given by
v
σ
p
(I
p
) =
hI
p
π
σ
p
(h)v
σ
p
(h)
hI
p
π
σ
p
(h)
while an action-dependent counterfactual value has the form
v
σ
p
(I
p
,a) =
hI
p
π
σ
p
(h)v
σ
p
(h ·a)
hI
p
π
σ
p
(h)
The instantaneous regret r
t
(I,a) of player p for action a A(I) on iteration t is how much better
off p would have been had it always chosen action a in I and played to get to I but according
to σ
t
thereafter, and is formally defined as
r
t
(I,a) = π
σ
t
p
(I)
v
σ
t
(I,a) v
σ
t
(I)
The overall regret on iteration T represents the regret accumulated so far and is defined as
R
T
p
(I,a) =
T
t=1
r
t
p
(I,a)
Regret matching is an algorithm for choosing actions in the future in proportion to positive
regret, or how much better the outcome could have been. CFR determines the strategy in
one iteration by applying regret matching to each information set [3] [7]. Formally, on each
iteration t + 1, player p chooses action a based on
σ
t+1
p
(I,a) =
R
t
p
(I,a)
+
aA(I)
R
t
p
(I, ˜a)
+
if
˜a
R
t
p
(I, ˜a)
+
> 0
1
|A(I)|
otherwise
where x
+
= max(x, 0). The average strategy for an information set I on iteration T is defined
as
σ
T
p
(I) =
T
t=1
π
σ
t
p
(I)σ
t
p
(I)
T
t=1
π
σ
t
p
(I)
and has the property of forming an ε Nash equilibrium as T . In two-player zero-sum
games, if both players have average regret
R
T
p
T
ε, then their average strategies (σ
T
p
,σ
T
p
) forms
a 2εNash equilibrium. Thus, CFR constitutes an anytime algorithm for finding an ε Nash
equilibrium in two-player zero-sum games.
5
1.7 Advancements of CFR
There are many improvements over the vanilla CFR algorithm that have been shown to achieve
much faster performance. CFR+ is a variation which changes the overall regret and average
strategy equations in order to converge an order of magnitude faster [27]. First, instead of
simultaneous updates for both players, CFR+ updates average strategies σ
T
p
alternatively for
each player and weights each iteration t by multiplying it with t
2
. Secondly, after each iteration,
all actions with negative regret are set to zero. Formally, CFR+ replaces the overall regret
R
t
p
(I,a)
+
with
Q
T
(I,a) = max
0,Q
T 1
(I,a) + r
t
(I,a)
Despite its advantages, CFR+ is sensitive to variance and is therefore difficult to use with sam-
pling and function approximation [4]. Linear CFR (LCFR) is a form of Discounted Regret
Minimization which is identical to vanilla CFR, except on iteration t the updates to the instan-
taneous regrets r
t
and average strategies σ
t
are given weight t . Formally, in LCFR, the regret
is defined as
R
T
(I,a) =
T
t=1
tr
t
(I,a)
while the average strategy is defined as
σ
T
p
(I) =
T
t=1
tπ
σ
t
p
(I)σ
t
p
(I)
T
t=1
tπ
σ
t
p
(I)
CFR variants have focused on improvements that allow them to find solutions in ever bigger
games. CFR+ was used to near-optimally solve Heads-up Limit Texas Hold’em, a version
of poker with 10
13
unique decision points [28]. DeepStack was the first agent able to beat
top human poker players at the game of Heads-up No-Limit Texas Hold’em, the full version
with 10
160
decision points, by utilizing a hybrid of vanilla CFR and CFR+. In DeepStack, the
regret minimizing algorithm uses uniform weighting and simultaneuos updates like CFR but
which computes the final average strategy and average counterfactual values by omitting early
iterations [22]. CFR+ was also used to approximately solve HUNL endgames in Libratus, a
poker agent which defeated HUNL top professionals [14] [12]. LCFR is faster than CFR+ in
certain settings with wide distributions in payoffs and tolerates errors far better [13], leading to
drastically faster convergence without increasing the computation cost per iteration.
1.8 Superhuman performance in Poker Games
The most notable advancement in the recent past is that algorithms which are able to compete
with humans at HUNL employ strategies computed online during play instead of being re-
trieved from an offline pre-computation. Such methods, which are called real-time solving for
Libratus and re-solving for DeepStack, have been developed independently and concurrently.
The similarity between them is that they both solve nested subgames of poker as they appear
during play. DeepStack solves a depth-limited subgame during the betting rounds by estimat-
ing the opponent counterfactual values at the depth limit with a deep neural network. Libratus
plays according to a detailed pre-computed strategy at the beginning of the game and rounds
off unexpected opponent actions to a nearby abstraction. When the game is nearing the end, it
6
then creates a much more detailed abstraction of the subgame in real-time and solves it taking
into consideration the broader context of the original abstraction.
The common challenge for both DeepStack and Libratus is incorporating the online com-
putations of the current subgame within the broader game. Solving subgames in imperfect-
information games cannot be done in isolation because their optimal strategy may depend on
other, unreached subgames, that are related to the current game state. Another challenge that
both agents had to overcome is considering how the opponent can adapt to changes in the
strategies. For example, always betting with good hands is a bad strategy since the opponent
can learn to fold our bets. Another example is that when abstractions are incorporated, players
can exploit them and make strategies which play actions that are the most overlooked by the
abstraction. Libratus incorporates a self-improvement module which enhances its abstraction
as players search for holes in the strategy by computing game-theoretic strategies for miss-
ing branches in the abstraction. Finally, states in imperfect-information games do not have a
single well-defined value, but depend on aspects such as how players have reached the state.
DeepStack learns a counterfactual value function with deep neural networks that approximates
outcomes in a state using players’ ranges, which are probability distributions over possible
hands given that the public state is reached. Furthermore, rather than a single value, the deep
counterfactual value network outputs a vector of counterfactual values v
σ
j
, where j iterates over
each possible hand combination.
7
Chapter 2
Development
This project investigates the empirical performance of various CFR and Regret Matching algo-
rithms on simplified imperfect-information games.
2.1 Rock Paper Scissors
Rock Paper Scissors (RPS) is a two player zero-sum game where each player simultaneously
makes one of three choices: rock (R), paper (P) and scissors (S). Each choice gives the opportu-
nity to win, lose or draw against the opponent. Action R is better than S, which in turn is better
than R, which in turn is better than P. Every choice has a counter-choice, making the game a
paradigm for studying competition caused by cyclical dominance. The goal of the game is to
beat the opponent’s strategy.
Formally, cyclical dominance can be expressed quantitatively with a payoff matrix. The
game has a set of two players P = {P
1
,P
2
}, a finite set of actions for each player p P called
S
p
, a set of possible combinations of actions A = S
1
× ... × S
n
and a function u which maps
each action combination to a vector of utilities for each player. Winning has an utility of 1,
losing has an utility of -1 and a draw has 0. RPS can be represented as normal-form game, an
n-dimensional table where each row and column corresponds to a single player’s actions, and
each entry in the table is a vector of payoff for each player. The payoff table for RPS is as
follows:
R P S
R 0,0 -1, 1 1, -1
P 1, -1 0, 0 -1, 1
S -1, 1 1, -1 0, 0
Table 2.1: Rock-Paper-Scissors payoff
Even though the game is played simultaneously, an alternative representation which is
equally valid is to consider that one player acts first and then second player has incomplete
information about the state of the game and has to make a choice without knowing the private
decision of the first player.
In an iterated game of RPS, if one player always chooses some action, their opponent can
exploit them and learn to always choose their counter-action. If one player always chooses
between two actions, their opponent can take advantage of such a regularity. To avoid exploita-
tion, and to reach a Nash equilibrium, players have to adopt a strategy which distributes their
8
P
1
Rock
Paper
Scissors
Rock
Rock
Paper
Scissors
Paper
Rock
Paper
Scissors
Scissors
P
2
s information set
0
0
1
1
1
1
1
1
0
0
1
1
1
1
1
1
0
0
P
1
s payoff
P
2
s payoff
Figure 2.1: Rock-Paper-Scissors shown as an extensive form game
actions evenly across the three options. This learning process is called regret matching, and is
one of the key components of learning in imperfection games. The main idea behind it is that
for an agent to learn how to play RPS well, it will select actions at random with a distribution
that is proportional to positive regret, i.e. how much the agent wishes to have taken that action
in the past. Ideally, the agent will minimize its expected regret over time, learn a best response
strategy and converge to a correlated equilibrium.
Rock-Paper-Scissors+ (RPS+) [16] is a variation of RPS in which instead of uniform out-
comes, if either player wins with Scissors, their payoff becomes 2 instead of 1, while their
opponent’s payoff becomes -2. This incentivizes, quite contrary to the intuition, an increased
probability of choosing both Rock and Paper in order to maintain the Nash equilibrium. By
making the action Scissors more attractive, it becomes more exploitable. In order to remove
exploitability, a player wishing to play Nash equilibrium will have to adopt a strategy of play-
ing Rock, Paper and Scissors with probability (0.4, 0.4, 0.2). RPS+ was defined by Brown et al
in 2018 in order to exemplify the challenge of depth-limited solving in imperfect-information
games.
R P S
R 0,0 -1, 1 2, -2
P 1, -1 0, 0 -2, 2
S -2, 2 2, -2 0, 0
Table 2.2: Rock-Paper-Scissors+ is a variation where the payoffs are biased towards Scissors
The next chapter includes experimental results showing the convergence rate of the CFR
algorithm as a function of exploitability, as well as its variations CFR+ and LCFR, on the games
of RPS and RPS+.
2.2 Coin Toss
Coin Toss is a simple two-player game that was first introduced in 2017 by Brown et al as an
example for why imperfect information subgames cannot be solved in isolation [12]. Subgame
solving is a key component of state of the art poker AIs and this project shows how CFR can be
used to learn a strategy in the entire game of Coin Toss. Furthermore, I present the relationship
between the strategy profile learned with CFR and the usage of subgame solving.
9
Coin
P
1
EV = 0.5
Sell
P
2
1
Heads
1
For f eit
1
Tails
Play
Heads
P
1
EV = 0.5
Sell
P
2
1
Heads
1
For f eit
1
Tails
Play
Tails
Figure 2.2: Coin Toss in extensive form. The dashed line represents P
2
s information set,
meaning P
2
cannot distinguish between the two states. The payoffs are shown only for P
1
, P
2
receives the negation of P
1
s payoff.
In Coin Toss, there are two players called P
1
and P
2
and a coin, which has equal probability
of landing Heads or Tails. The coin is flipped and only P
1
sees the outcome. Then, P
1
must
choose between two actions called ”Sell” and ”Play”. The ”Sell” action leads to a subgame
that has the expected value (EV) depending on the whether the coin landed Heads of Tails. If
the coin lands Heads, P
1
receives an EV of 0.5. On the other hand, if the coin lands Tails, P
1
receivs an EV of -0.5. Should P
1
choose the action ”Play”, then P
2
may guess how the coin
landed. If P
2
guesses correctly, P
1
receives an EV of -1, however, if P
2
guesses incorrectly, P
1
receives an EV of 1. P
2
can also choose to forfeit, in which case P
1
also receives en EV of
1. The optimal strategy for P
2
is to choose Heads with 25% probability and Tails with 75%
probability in the ”Play” subgame, which gives both players an EV of 0. P
1
can always choose
”Sell” to receive the same EV, or it could choose any mixed combination of ”Play” and ”Sell”
strategies and it would still receive at best and EV of 0. However, should the roles of the
sides of the coin reverse, so will the choices in the strategy of P
2
. If P
1
starts to receive an
EV of -0.5 when it sells Heads, and 0.5 when it sells Tails, P
2
will have to adapt its strategy
by guessing Heads with 75% probability and Tails with 25% probability. This shows that P
2
s
strategy depends on P
1
s outcome in an unrelated subgame that P
2
cannot even reach, but which
nevertheless influences P
1
s own strategy and in turn P
2
s decisions. Thus, it is impossible to
solve a subgame without using information from other parts of the game. This is the central
challenge of imperfect-information games as opposed to perfect information games.
The CFR algorithm, along with its variants explored in this project, visits the entire game
tree at every iteration, overcoming this limitation. Larger games, such as HUNL, are too big for
CFR to visit all their states even once. The classical solution, which is to visit an abstraction of
the entire game to search for a solution, leads itself to high exploitability. Subgame solving can
be applied to fitting a new solution of a subgame within the overall strategy of a much larger
strategy created for an abstraction. Such methods were used in both DeepStack and Libratus to
win over top human players.
2.3 Kuhn Poker
Kuhn Poker is another simple zero-sum game which is played with a deck of 3 cards: the Jack,
the Queen and the King. The game was created by Harold W. Kuhn in 1950 [19]. Two players,
P
1
and P
2
, each bet 1 chip, called the ante, into the pot, before the they are both dealt one
card from the deck. Their card is held as private information. Each player takes turn in being
10
Deck
Deck
...
Jack
Deck
...
Queen
Deck
P
1
...
Jack
P
1
P
2
1
Pass
P
1
-1
Pass
2
Bet
Bet
Pass
P
2
1
Pass
2
Bet
Bet
Queen
King
Figure 2.3: Partial game tree for Kuhn Poker. Payoffs are shown only for P
1
, P
2
receives the
opposite of P
1
s payoff. The deck of cards is shown as a player which deals a card at random
to P
1
, then to P
2
.
the first to take an action. On a turn, a player has the option to either pass or bet. When a
player bets, they put an additional chip into the pot. If a player passes after their opponent bets,
they forfeit their ante and the pot gets awarded to their opponent. However, if there are two
successive passes or two successive bets, there is a showdown and the player with the strongest
hand (higher card) wins and takes all the chips in the pot.
There are six possible ways in which the cards are dealt to the players, and 12 resulting
information sets. The expected value for P
1
in the Nash equilibrium for this game is 1/18,
which means they will lose on average.
Many of the challenges of HUNL such as bluffing with a weak hand, passing a good hand
and avoiding to be exploited by the opponent are all are captured in Kuhn Poker. CFR is an
algorithm which can be used to search for a solution in Kuhn poker, where a solution means a
strategy profile which is part of an ε Nash equilibrium. This project explores the performance
differences between CFR, CFR+ and LCFR in learning an optimal strategy in Kuhn Poker.
2.4 Leduc Hold’em
Leduc Hold’em is a variant of Texas Hold’em which is sometimes used in academic con-
texts, first introduced in [25]. Leduc Hold’em seeks to retain the strategic elements of Texas
Hold’em, while keeping the size of the game smaller and computing exact solutions tractable.
The game is played with a deck of six cards, comprising two suits of three ranks each: the
Jack, the Queen and the King. There are two rounds of betting in the game. Each player starts
with a stack of a fixed amount of chips, and has to put one chip into the pot as an ante. Then,
they are both being dealt a hand made of one card to keep private. The first round of betting
takes place, with players having the option of calling to match their opponent’s bet, raising the
bet amount or folding their hand. The raise amount is limited to 2 and 4. Players must call
11
their opponent’s raise if they want to proceed to the next round. Then, a board card is revealed,
and there is another round of betting. Finally, if no player folds, there is a showdown, and the
strongest hand wins the pot. The strongest hand is the one which has the same rank as the card
on the table. If no player makes a pair with the board card, the strongest hand is the one with
the higher rank.
2.5 No-Limit Leduc Hold’em
A simple extension of Leduc Hold’em is to place no limit on the amount of chips which can be
raised by either player. Every time a player wants to re-raise their opponent’s previous bet, they
have to raise at least double that amount. Very quickly, due to the exponential increase in the
raise amounts, each player has to go all-in with their bets, wagering all their remaining chips.
No-Limit Leduc Hold’em is a much larger game than Leduc Hold’em due to the increase in the
number of actions at each decision point, from a small number of betting actions to a virtually
continuous space of possibilities.
2.6 Deep Counterfactual Value Networks
Deep neural networks (DNNs) are computing systems originally inspired by information pro-
cessing in the biological neural networks that constitute animal brains. DNNs have proven
capabilities such as modelling complex non-linear relationships and as a result they are respon-
sible for significant advacements in image and speech recognition [18] and playing games.
Recently, DeepStack has used DNNs to estimate the value of future game states in HUNL, us-
ing millions of solutions to poker sub-games as training data. This was a key component of the
poker AI, enabling it to learn strategies that converge to a Nash equilibrium in HUNL without
using traditional abstraction techniques.
A common approach for solving large imperfect information games (in the order of 10
160
states) is to search for a solution in a smaller, abstracted version, and map strategies back to
the original game. Such strategies are highly exploitable to opponents explicitly looking for
sections in the full game tree that are weakly covered by the abstraction. DeepStack is a strong
poker AI that combines ideas from perfect information games with algorithms such as CFR
and endgame solving [2], while remaining theoretically sound [22]. Endgame solving is a
method for finding a solution with greater accuracy in a specific subgame encountered during
play that fits within the abstract equilibrium strategies that are already computed for the initial
portion of the game. Endgame solving alone comes with disadvantages, since the technique
is intractable, unless applied to small game trees, and requires action translation and dealing
with the off-tree actions of the opponent [2]. DeepStack proposes to solve these shortcomings
by continually re-solving the subgame at every action taken, starting from the current state.
Continual re-solving is still not enough to solve a big game such as HUNL, since computing a
solution at the early stages of the game means exploring the entire tree, and this is unfeasible.
For this reason, DeepStack only computes a strategy down to a limited depth of the tree, and
uses deep neural networks to estimate expected counterfactual values for each hand on future
betting rounds in its re-solving step [17]. Consequently, the deep counterfactual value network
(DCVN) is trained with generated random poker situations in order to predict the output of the
CFR algorithm, counterfactual values for every node in the sub-tree.
12
Chapter 3
Evaluation
The previous chapter introduces a number of two-player zero sum games, in increasing com-
plexity. This chapter presents the evaluation outcomes of learning algorithms, applied to these
games. Strategy profiles are plotted against the number of iterations in the training process. In
small poker games, exploitability is plotted as a function of iterations.
3.1 Best Response
First introduced in Chapter 1, Best Response (BR) is a strategy profile that maximizes an
agent’s utility against an adversarial strategy. When all players are playing BR, their strategy
profiles makes a Nash equilibrium. For simple games, such as RPS and RPS+, BR can be
calculated directly from the payoff matrix. In the RPS case, each player will have to choose
every action with probability of 1/3. RPS+, on the other hand, has a Nash equilibrium profile
when players adopt a strategy with 0.2 probability of playing Scissors and 0.4 probability of
playing Rock and Paper, respectively.
3.2 Experiments with BR in RPS and RPS+
I conducted experiments by estimating BR strategy profile using Regret Matching in the games
of RPS and RPS+. Average strategy for each action is plotted against number of BR iterations.
The action profile starts evenly distributed, then quickly converges to a Nash equilibrium. The
gray lines provide the exact solutions for the equilibrium that the algorithm is converging to.
13
Figure 3.1: RPS with 1000 iterations
Figure 3.2: RPS with 10k iterations
14
Figure 3.3: RPS with 100k iterations
Figure 3.4: RPS with 1 million iterations
15
Figure 3.5: RPS+ with 1000 iterations
Figure 3.6: RPS+ with 10k iterations
16
Figure 3.7: RPS+ with 100k iterations
Figure 3.8: RPS+ with 1 million iterations
17
3.3 Experiments with CFR in Coin Toss
Coin Toss is a game which demonstrates that the strategy one player has to take in a subgame
may depend on a subgame’s expected value in a part of the game tree that the player alone
cannot reach. I used CFR to empirically test this hypothesis and plotted the strategy profile for
P
2
for an increasingly higher number of iterations. Then, I switched the expected value of P
1
s
”Sell” subgame (which depends on which side the coin lands) to show that P
2
s strategy profile
has to switch probability distributions for its actions as well.
Figure 3.9: CFR with 1000 iterations
Figure 3.10: CFR with 10k iterations
18
Figure 3.11: CFR with 100k iterations
Figure 3.12: CFR with 1 million iterations
19
Figure 3.13: CFR with 1 million iterations and reversed outcomes for P
1
s ”Sell” subgame
20
3.4 Experiments with CFR, CFR+ and LCFR in Kuhn Poker
The CFR algorithm can be used to find a solution (strategy profile that forms a Nash equi-
librium) in the game of Kuhn Poker. Out of the variations used, CFR+ was the fastest in
converging to a solution. At each iteration step, the graphs plot the exploitability of the current
solution in mbb/g (mili-big blinds per game), a measure of how much a player could lose in
expectation using that strategy. Out of the three variations, CFR+ was the quickest to learn a
strategy, followed by LCFR and CFR.
Figure 3.14: CFR
Figure 3.15: LCFR
21
Figure 3.16: CFR+
22
3.5 Experiments with CFR, CFR+ and LCFR in Leduc Hold’em
Leduc Hold’em is larger game than Kuhn Poker, which is played with 6 cards instead of 3
and has an extra round of betting. Exploitability is plotted on a log axis against the number
of iterations of the CFR variations. Once again, CFR+ was the quickest algorithm to learn a
strategy, followed by LCFR.
Figure 3.17: CFR
Figure 3.18: LCFR
23
Figure 3.19: CFR+
24
Chapter 4
Conclusions and related work
This project begins with an introduction to game theory, imperfect-information games and
poker. Then, it presents the learning algorithms that can be used to search for solutions in
these games, and explains what a good solution represents. It then describes the importance,
as well as the applicability, of the learning methods presented in real world cases. Then, it
investigates current state of the art methods for learning in No-Limit games, a key discovery
over the last few years which enables learning in much larger games than previously possible.
The following chapters introduce a number of notable games and presents the empirical results
of applying learning methods to search for solutions in them.
The key achievements of this project are the implementations of Regret Matching to find a
Best Response strategy in RPS/RPS+ and that of CFR (along with CFR+ and LCFR) to find a
Nash equlibrium in the games of Coin Toss, Kuhn Poker and Leduc Hold’em.
Related work includes using various sampling techniques to approximate applying the CFR
algorithm to the entire tree by only visiting a small subset of the branches. Moreover, new ap-
plications of Deep Learning to CFR [15] [26] present a different approach of merging classical
game-theoretic algorithms with non-linear function approximation methods.
The planning of this project is divided into two parts. In the first part I was involved with
reading background material on state-of-the-art methods and building my understanding of
the subject. In the second part I was concerned with replicating as much of the results as
possible. Due to time constrains and the vast reading required to understand the state-of-the-
art, only part of the learning methods have been replicated. The work can be extended by
using Deep Counterfactual Value Networks to explore solutions in No-Limit betting games.
Moreover, deep learning can be applied directly to CFR in order to intrinsically abstract actions
and increase the scalability of the tree traversal.
25
Bibliography
[1] J. bronowski, the ascent of man, documentary (1973). episode 13.
[2] S. ganzfried and t. sandholm, ”endgame solving in large imperfect-information games”,
carnegie mellon university, in: Proceedings of the 14th international conference on au-
tonoumous agents and multiagent systems (aaamas 2015), 2015.
[3] Kamalika chaudhuri, yoav freund, and daniel j hsu. a parameter-free hedging algorithm.
in advances in neural information processing systems, pages 297305, 2009.
[4] http://mlanctot.info/files/papers/nips09mccfr.pdf.
[5] https://openai.com/blog/openai-five/.
[6] https://cs.stanford.edu/people/eroberts/courses/soco/projects/1998-99/game-
theory/neumann.html.
[7] Nick littlestone and m. k. warmuth. the weighted majority algorithm. information and
computation, 108(2):212261, 1994.
[8] https://en.wikipedia.org/wiki/game complexity.
[9] https://news.ycombinator.com/item?id=13558170.
[10] http://poker.cs.ualberta.ca/publications/nips07-cfr.pdf.
[11] N. Brown and T. Sandholm. Libratus: The superhuman ai for no-limit poker (demonstra-
tion).
[12] N. Brown and T. Sandholm. Safe and nested subgame solving for imperfect-information
games, 2017.
[13] N. Brown and T. Sandholm. Solving imperfect-information games via discounted regret
minimization, 2018.
[14] N. Brown and T. Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats
top professionals, 2018.
[15] N. Brown, A. Lerer, S. Gross, and T. Sandholm. Deep counterfactual regret minimization,
2018.
[16] N. Brown, T. Sandholm, and B. Amos. Depth-limited solving for imperfect-information
games, 2018.
26
[17] P. Hopner and E. L. Menca. Analysis and optimization of deep counterfactual value
networks. 2018. doi: 10.1007/978-3-030-00111-7 26.
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Advances in Neural Information Process-
ing Systems. pp. 11061114, 25, 2012.
[19] H. W. Kuhn. Simplified two-person poker. In Harold W. Kuhn and Albert W. Tucker, edi-
tors, Contributions to the Theory of Games, volume 1, pages 97103., Princeton University
Press, 1950.
[20] D. N. L. Levy and M. Newborn. How Computers Play Chess. Ishi Press, 2009.
[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Ried-
miller. Playing Atari with Deep Reinforcement Learning. DeepMind, 2013.
[22] M. Moravcik, M. Schmid, N. Burch, V. Lisy, D. Morrill, N. Bard, T. Davis, K. Waugh,
M. Johanson, and M. Bowling. Deepstack: Expert-level artificial intelligence in no-limit
poker. 2017. doi: 10.1126/science.aam6960.
[23] T. Sandholm. Solving imperfect-information games. Science, 347(6218):122–123, 2015.
ISSN 0036-8075. doi: 10.1126/science.aaa4614. URL https://science.sciencemag.
org/content/347/6218/122.
[24] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrit-
twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham,
N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and
D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Na-
ture, 529(7587):484–489, Jan. 2016. doi: 10.1038/nature16961.
[25] F. Southey, M. Bowling, B. Larson, C. Piccione, N. Burch, D. Billings, and C. Rayner.
Bayes Bluff: Opponent Modelling in Poker. University of Alberta.
[26] E. Steinberger. Single deep counterfactual regret minimization, 2019.
[27] O. Tammelin. Solving large imperfect information games using cfr+, 2014.
[28] O. Tammelin, N. Burch, M. Johanson, and M. Bowling. Solving heads-up limit texas
holdem. In Proceedings of the International Joint Conference on Artificial Intelligence
(IJCAI), pages 645652, 2015.
27