Get direct links to references, BibTeX extraction and comments on all arXiv papers with: Librarian
There are 4 main categories of games depending on whether the game ...
#### Heads-up no-limit Texas hold’em Common variant of poker playe...
In 1944 John von Neumann and Oskar Morgenstern co-authored a book e...
This is a massive number. Chess has approximately $10^{47}$ decisio...
**Counterfactual Regret Minimization (CFR)** is an algorithm by whi...
**mbb/g** - milli-big blinds per game. In poker, blinds are forced ...
In heads-up poken one player will post the big blind and their oppo...
### Nash Equilibrium Let’s say you were playing rock, paper scisso...
This figure provides a great high level picture of DeepStack. As th...
**Huber Loss** is a loss function that is less sensitive to outlier...
DeepStack: Expert-Level Artiﬁcial Intelligence in
Matej Morav
ˇ
c
´
ık
,,
, Martin Schmid
,,
, Neil Burch
, Viliam Lis
´
y
,
,
Dustin Morrill
, Nolan Bard
, Trevor Davis
,
Kevin Waugh
, Michael Johanson
, Michael Bowling
,
Department of Computing Science, University of Alberta,
Department of Applied Mathematics, Charles University,
Prague, Czech Republic
Department of Computer Science, FEE, Czech Technical University,
Prague, Czech Republic
These authors contributed equally to this work and are listed in alphabetical order.
To whom correspondence should be addressed; E-mail: bowling@cs.ualberta.ca
Artiﬁcial intelligence has seen several breakthroughs in recent years, with
games often serving as milestones. A common feature of these games is that
players have perfect information. Poker is the quintessential game of imperfect
information, and a longstanding challenge problem in artiﬁcial intelligence.
We introduce DeepStack, an algorithm for imperfect information settings. It
combines recursive reasoning to handle information asymmetry, decomposi-
tion to focus computation on the relevant decision, and a form of intuition that
is automatically learned from self-play using deep learning. In a study involv-
ing 44,000 hands of poker, DeepStack defeated with statistical signiﬁcance pro-
fessional poker players in heads-up no-limit Texas hold’em. The approach is
theoretically sound and is shown to produce more difﬁcult to exploit strategies
than prior approaches.
1
Games have long served as benchmarks and marked milestones of progress in artiﬁcial
intelligence (AI). In the last two decades, computer programs have reached a performance that
1
This is the author’s version of the work. It is posted here by permission of the AAAS for personal use, not for
redistribution. The deﬁnitive version was published in Science, (March 02, 2017), doi: 10.1126/science.aam6960.
1
exceeds expert human players in many games, e.g., backgammon (1), checkers (2), chess (3),
Jeopardy! (4), Atari video games (5), and go (6). These successes all involve games with
information symmetry, where all players have identical information about the current state of
the game. This property of perfect information is also at the heart of the algorithms that enabled
these successes, e.g., local search during play (7, 8).
The founder of modern game theory and computing pioneer, von Neumann, envisioned
reasoning in games without perfect information. “Real life is not like that. Real life consists of
blufﬁng, of little tactics of deception, of asking yourself what is the other man going to think
I mean to do. And that is what games are about in my theory. (9) One game that fascinated
von Neumann was poker, where players are dealt private cards and take turns making bets or
blufﬁng on holding the strongest hand, calling opponents’ bets, or folding and giving up on the
hand and the bets already added to the pot. Poker is a game of imperfect information, where
players’ private cards give them asymmetric information about the state of game.
Heads-up no-limit Texas hold’em (HUNL) is a two-player version of poker in which two
cards are initially dealt face-down to each player, and additional cards are dealt face-up in
three subsequent rounds. No limit is placed on the size of the bets although there is an overall
limit to the total amount wagered in each game (10). AI techniques have previously shown
success in the simpler game of heads-up limit Texas hold’em, where all bets are of a ﬁxed size
resulting in just under 10
14
decision points (11). By comparison, computers have exceeded
expert human performance in go (6), a perfect information game with approximately 10
170
decision points (12). The imperfect information game HUNL is comparable in size to go, with
the number of decision points exceeding 10
160
(13).
Imperfect information games require more complex reasoning than similarly sized perfect
information games. The correct decision at a particular moment depends upon the probability
distribution over private information that the opponent holds, which is revealed through their
past actions. However, how our opponent’s actions reveal that information depends upon their
knowledge of our private information and how our actions reveal it. This kind of recursive
reasoning is why one cannot easily reason about game situations in isolation, which is at the
heart of heuristic search methods for perfect information games. Competitive AI approaches
in imperfect information games typically reason about the entire game and produce a complete
strategy prior to play (14–16). Counterfactual regret minimization (CFR) (14,17,18) is one such
technique that uses self-play to do recursive reasoning through adapting its strategy against itself
over successive iterations. If the game is too large to be solved directly, the common response
is to solve a smaller, abstracted game. To play the original game, one translates situations and
actions from the original game to the abstract game.
Although this approach makes it feasible for programs to reason in a game like HUNL, it
does so by squeezing HUNL’s 10
160
situations down to the order of 10
14
abstract situations.
Likely as a result of this loss of information, such programs are behind expert human play. In
2015, the computer program Claudico lost to a team of professional poker players by a margin
of 91 mbb/g (19), which is a “huge margin of victory” (20). Furthermore, it has been recently
shown that abstraction-based programs from the Annual Computer Poker Competition have
2
massive ﬂaws (21). Four such programs (including top programs from the 2016 competition)
were evaluated using a local best-response technique that produces an approximate lower-bound
on how much a strategy can lose. All four abstraction-based programs are beatable by over
3,000 mbb/g, which is four times as large as simply folding each game.
DeepStack takes a fundamentally different approach. It continues to use the recursive rea-
soning of CFR to handle information asymmetry. However, it does not compute and store a
complete strategy prior to play and so has no need for explicit abstraction. Instead it consid-
ers each particular situation as it arises during play, but not in isolation. It avoids reasoning
about the entire remainder of the game by substituting the computation beyond a certain depth
with a fast approximate estimate. This estimate can be thought of as DeepStack’s intuition:
a gut feeling of the value of holding any possible private cards in any possible poker situa-
tion. Finally, DeepStack’s intuition, much like human intuition, needs to be trained. We train it
with deep learning (22) using examples generated from random poker situations. We show that
DeepStack is theoretically sound, produces strategies substantially more difﬁcult to exploit than
abstraction-based techniques, and defeats professional poker players at HUNL with statistical
signiﬁcance.
DeepStack
DeepStack is a general-purpose algorithm for a large class of sequential imperfect information
games. For clarity, we will describe its operation in the game of HUNL. The state of a poker
game can be split into the players’ private information, hands of two cards dealt face down, and
the public state, consisting of the cards laying face up on the table and the sequence of betting
actions made by the players. Possible sequences of public states in the game form a public tree
with every public state having an associated public subtree (Fig. 1).
A player’s strategy deﬁnes a probability distribution over valid actions for each decision
point, where a decision point is the combination of the public state and the hand for the acting
player. Given a player’s strategy, for any public state one can compute the player’s range,
which is the probability distribution over the player’s possible hands given that the public state
is reached.
Fixing both players’ strategies, the utility for a particular player at a terminal public state,
where the game has ended, is a bilinear function of both players’ ranges using a payoff matrix
determined by the rules of the game. The expected utility for a player at any other public
state, including the initial state, is the expected utility over reachable terminal states given the
players’ ﬁxed strategies. A best-response strategy is one that maximizes a player’s expected
utility against an opponent strategy. In two-player zero-sum games, like HUNL, a solution
or Nash equilibrium strategy (23) maximizes the expected utility when playing against a best-
response opponent strategy. The exploitability of a strategy is the difference in expected utility
against its best-response opponent and the expected utility under a Nash equilibrium.
The DeepStack algorithm seeks to compute and play a low-exploitability strategy for the
3
PRE-FLOP
FLOP
Raise
Raise
Fold
Call
Call
Fold
Call
Fold
Call
Call
Fold
Bet
Check
Check
Raise
Check
TURN
-50
-100
100
100
-200
Fold
1
22
1
1
2
2
1
Figure 1: A portion of the public tree in HUNL. Nodes represent public states, whereas edges
represent actions: red and turquoise showing player betting actions, and green representing
public cards revealed by chance. The game ends at terminal nodes, shown as a chip with an
associated value. For terminal nodes where no player folded, the player whose private cards
form a stronger poker hand receives the value of the state.
game, i.e., solve for an approximate Nash equilibrium. DeepStack computes this strategy dur-
ing play only for the states of the public tree that actually arise. Although computed during
play, DeepStack’s strategy is static, albeit stochastic, because it is the result of a deterministic
computation that produces a probability distribution over the available actions.
The DeepStack algorithm (Fig. 2) is composed of three ingredients: a sound local strategy
computation for the current public state, depth-limited lookahead using a learned value func-
tion to avoid reasoning to the end of the game, and a restricted set of lookahead actions. At a
conceptual level these three ingredients describe heuristic search, which is responsible for many
of AI’s successes in perfect information games. Until DeepStack, no theoretically sound appli-
cation of heuristic search was known in imperfect information games. The heart of heuristic
search methods is the idea of “continual re-searching”, where a sound local search procedure is
invoked whenever the agent must act without retaining any memory of how or why it acted to
reach the current state. At the heart of DeepStack is continual re-solving, a sound local strategy
computation which only needs minimal memory of how and why it acted to reach the current
public state.
Continual re-solving. Suppose we have taken actions according to a particular solution strat-
egy but then in some public state forget this strategy. Can we reconstruct a solution strategy
for the subtree without having to solve the entire game again? We can, through the process of
4
re-solving (17). We need to know both our range at the public state and a vector of expected
values achieved by the opponent under the previous solution for each opponent hand (24). With
these values, we can reconstruct a strategy for only the remainder of the game, which does not
increase our overall exploitability. Each value in the opponent’s vector is a counterfactual value,
a conditional “what-if value that gives the expected value if the opponent reaches the public
state with a particular hand. The CFR algorithm also uses counterfactual values, and if we use
CFR as our solver, it is easy to compute the vector of opponent counterfactual values at any
public state.
Re-solving, however, begins with a strategy, whereas our goal is to avoid ever maintaining
a strategy for the entire game. We get around this by doing continual re-solving: reconstructing
a strategy by re-solving every time we need to act; never using the strategy beyond our next
action. To be able to re-solve at any public state, we need only keep track of our own range
and a suitable vector of opponent counterfactual values. These values must be an upper bound
on the value the opponent can achieve with each hand in the current public state, while being
no larger than the value the opponent could achieve had they deviated from reaching the public
state. This is an important relaxation of the counterfactual values typically used in re-solving,
with a proof of sufﬁciency included in our proof of Theorem 1 below (10).
At the start of the game, our range is uniform and the opponent counterfactual values are
initialized to the value of being dealt each private hand. When it is our turn to act we re-solve the
subtree at the current public state using the stored range and opponent values, and act according
to the computed strategy, discarding the strategy before we act again. After each action, either
by a player or chance dealing cards, we update our range and opponent counterfactual values
according to the following rules: (i) Own action: replace the opponent counterfactual values
with those computed in the re-solved strategy for our chosen action. Update our own range using
the computed strategy and Bayes’ rule. (ii) Chance action: replace the opponent counterfactual
values with those computed for this chance action from the last re-solve. Update our own range
by zeroing hands in the range that are impossible given new public cards. (iii) Opponent action:
no change to our range or the opponent values are required.
These updates ensure the opponent counterfactual values satisfy our sufﬁcient conditions,
and the whole procedure produces arbitrarily close approximations of a Nash equilibrium (see
Theorem 1). Notice that continual re-solving never keeps track of the opponent’s range, instead
only keeping track of their counterfactual values. Furthermore, it never requires knowledge of
the opponent’s action to update these values, which is an important difference from traditional
re-solving. Both will prove key to making this algorithm efﬁcient and avoiding any need for the
translation step required with action abstraction methods (25, 26).
Continual re-solving is theoretically sound, but by itself impractical. While it does not ever
maintain a complete strategy, re-solving itself is intractable except near the end of the game.
In order to make continual re-solving practical, we need to limit the depth and breadth of the
re-solved subtree.
5
Limited depth lookahead via intuition. As in heuristic search for perfect information games,
we would like to limit the depth of the subtree we have to reason about when re-solving. How-
ever, in imperfect information games we cannot simply replace a subtree with a heuristic or
precomputed value. The counterfactual values at a public state are not ﬁxed, but depend on how
players play to reach the public state, i.e., the players’ ranges (17). When using an iterative
algorithm, such as CFR, to re-solve, these ranges change on each iteration of the solver.
DeepStack overcomes this challenge by replacing subtrees beyond a certain depth with a
learned counterfactual value function that approximates the resulting values if that public state
were to be solved with the current iteration’s ranges. The inputs to this function are the ranges
for both players, as well as the pot size and public cards, which are sufﬁcient to specify the
public state. The outputs are a vector for each player containing the counterfactual values of
holding each hand in that situation. In other words, the input is itself a description of a poker
game: the probability distribution of being dealt individual private hands, the stakes of the
game, and any public cards revealed; the output is an estimate of how valuable holding certain
cards would be in such a game. The value function is a sort of intuition, a fast estimate of the
value of ﬁnding oneself in an arbitrary poker situation. With a depth limit of four actions, this
approach reduces the size of the game for re-solving from 10
160
decision points at the start of
the game down to no more than 10
17
decision points. DeepStack uses a deep neural network as
its learned value function, which we describe later.
Sound reasoning. DeepStack’s depth-limited continual re-solving is sound. If DeepStack’s
intuition is “good” and “enough” computation is used in each re-solving step, then DeepStack
plays an arbitrarily close approximation to a Nash equilibrium.
Theorem 1 If the values returned by the value function used when the depth limit is reached
have error less than , and T iterations of CFR are used to re-solve, then the resulting strategy’s
exploitability is less than k
1
+ k
2
/
T , where k
1
and k
2
are game-speciﬁc constants. For the
proof, see (10).
Sparse lookahead trees. The ﬁnal ingredient in DeepStack is the reduction in the number of
actions considered so as to construct a sparse lookahead tree. DeepStack builds the lookahead
tree using only the actions fold (if valid), call, 2 or 3 bet actions, and all-in. This step voids
the soundness property of Theorem 1, but it allows DeepStack to play at conventional human
speeds. With sparse and depth-limited lookahead trees, the re-solved games have approximately
10
7
decision points, and are solved in under ﬁve seconds using a single NVIDIA GeForce GTX
1080 graphics card. We also use the sparse and depth-limited lookahead solver from the start of
the game to compute the opponent counterfactual values used to initialize DeepStack’s continual
re-solving.
Relationship to heuristic search in perfect information games. There are three key chal-
lenges that DeepStack overcomes to incorporate heuristic search ideas in imperfect information
6
C
Sampled poker
situations
BA
Agent's possible actions
Current public state
Agent’s range
Opponent counterfactual values
Neural net [see B]
Action history
Public tree
Subtree
Values
Ranges
Figure 2: DeepStack overview. (A) DeepStack reasons in the public tree always producing
action probabilities for all cards it can hold in a public state. It maintains two vectors while
it plays: its own range and its opponent’s counterfactual values. As the game proceeds, its
own range is updated via Bayes’ rule using its computed action probabilities after it takes an
action. Opponent counterfactual values are updated as discussed under “Continual re-solving”.
To compute action probabilities when it must act, it performs a re-solve using its range and the
opponent counterfactual values. To make the re-solve tractable it restricts the available actions
of the players and lookahead is limited to the end of the round. During the re-solve, counterfac-
tual values for public states beyond its lookahead are approximated using DeepStack’s learned
evaluation function. (B) The evaluation function is represented with a neural network that takes
the public state and ranges from the current iteration as input and outputs counterfactual values
for both players (Fig. 3). (C) The neural network is trained prior to play by generating ran-
dom poker situations (pot size, board cards, and ranges) and solving them to produce training
examples. Complete pseudocode can be found in Algorithm S1 (10).
games. First, sound re-solving of public states cannot be done without knowledge of how and
why the players acted to reach the public state. Instead, two additional vectors, the agent’s range
and opponent counterfactual values, must be maintained to be used in re-solving. Second, re-
solving is an iterative process that traverses the lookahead tree multiple times instead of just
once. Each iteration requires querying the evaluation function again with different ranges for
every public state beyond the depth limit. Third, the evaluation function needed when the depth
limit is reached is conceptually more complicated than in the perfect information setting. Rather
than returning a single value given a single state in the game, the counterfactual value function
needs to return a vector of values given the public state and the players’ ranges. Because of this
complexity, to learn such a value function we use deep learning, which has also been successful
at learning complex evaluation functions in perfect information games (6).
Relationship to abstraction-based approaches. Although DeepStack uses ideas from ab-
straction, it is fundamentally different from abstraction-based approaches. DeepStack restricts
the number of actions in its lookahead trees, much like action abstraction (25, 26). However,
7
each re-solve in DeepStack starts from the actual public state and so it always perfectly un-
derstands the current situation. The algorithm also never needs to use the opponent’s actual
action to obtain correct ranges or opponent counterfactual values, thereby avoiding translation
of opponent bets. We used hand clustering as inputs to our counterfactual value functions, much
like explicit card abstraction approaches (27, 28). However, our clustering is used to estimate
counterfactual values at the end of a lookahead tree rather than limiting what information the
player has about their cards when acting. We later show that these differences result in a strategy
substantially more difﬁcult to exploit.
Deep Counterfactual Value Networks
Deep neural networks have proven to be powerful models and are responsible for major ad-
vances in image and speech recognition (29, 30), automated generation of music (31), and
game-playing (5, 6). DeepStack uses deep neural networks with a tailor-made architecture, as
the value function for its depth-limited lookahead (Fig. 3). Two separate networks are trained:
one estimates the counterfactual values after the ﬁrst three public cards are dealt (ﬂop network),
the other after dealing the fourth public card (turn network). An auxiliary network for values
before any public cards are dealt is used to speed up the re-solving for early actions (10).
Architecture. DeepStack uses a standard feedforward network with seven fully connected
hidden layers each with 500 nodes and parametric rectiﬁed linear units (32) for the output. This
architecture is embedded in an outer network that forces the counterfactual values to satisfy
the zero-sum property. The outer computation takes the estimated counterfactual values, and
computes a weighted sum using the two players’ input ranges resulting in separate estimates of
the game value. These two values should sum to zero, but may not. Half the actual sum is then
subtracted from the two players’ estimated counterfactual values. This entire computation is
differentiable and can be trained with gradient descent. The network’s inputs are the pot size as
a fraction of the players’ total stacks and an encoding of the players’ ranges as a function of the
public cards. The ranges are encoded by clustering hands into 1,000 buckets, as in traditional
abstraction methods (27, 28, 33), and input as a vector of probabilities over the buckets. The
output of the network are vectors of counterfactual values for each player and hand, interpreted
as fractions of the pot size.
Training. The turn network was trained by solving 10 million randomly generated poker turn
games. These turn games used randomly generated ranges, public cards, and a random pot
size (10). The target counterfactual values for each training game were generated by solving
the game with players’ actions restricted to fold, call, a pot-sized bet, and an all-in bet, but no
card abstraction. The ﬂop network was trained similarly with 1 million randomly generated
ﬂop games. However, the target counterfactual values were computed using our depth-limited
8
Input
Bucket
ranges
7 Hidden Layers
• fully connected
• linear, PReLU
Output
Bucket
values
FEEDFORWARD
NEURAL NET
ZERO-SUM
NEURAL NET
Output
Counterfactual
values
CARD
COUNTERFACTUAL
VALUES
Zero-sum
Error
BUCKETING
(INVERSE)
BUCKETING
CARD
RANGES
500
500
500
500
500
500
500
1000
1
P2
P1
P1 P2
1326
P1
P2
Pot Public
1326
22100
1
1000
P2
P1
1000
P2
P1
Figure 3: Deep counterfactual value network. The inputs to the network are the pot size,
public cards, and the player ranges, which are ﬁrst processed into hand clusters. The output
from the seven fully connected hidden layers is post-processed to guarantee the values satisfy
the zero-sum constraint, and then mapped back into a vector of counterfactual values.
solving procedure and our trained turn network. The networks were trained using the Adam
gradient descent optimization procedure (34) with a Huber loss (35).
Evaluating DeepStack
We evaluated DeepStack by playing it against a pool of professional poker players recruited
by the International Federation of Poker (36). Thirty-three players from 17 countries were
recruited. Each was asked to complete a 3,000 game match over a period of four weeks between
November 7th and December 12th, 2016. Cash incentives were given to the top three performers
($5,000,$2,500, and $1,250 CAD). Evaluating performance in HUNL is challenging because of the large variance in per-game outcomes owing to randomly dealt cards and stochastic choices made by the players. The better player may lose in a short match simply because they were dealt weaker hands or their rare bluffs were made at inopportune times. As seen in the Claudico match (20), even 80,000 games may not be enough to statistically signiﬁcantly separate players whose skill differs by a consid- erable margin. We evaluate performance using AIVAT (37), a provably unbiased low-variance technique for evaluating performance in imperfect information games based on carefully con- structed control variates. AIVAT requires an estimated value of holding each hand in each public 9 Figure 4: Performance of professional poker players against DeepStack. Performance esti- mated with AIVAT along with a 95% conﬁdence interval. The solid bars at the bottom show the number of games the participant completed. state, and then uses the expected value changes that occur due to chance actions and actions of players with known strategies (i.e., DeepStack) to compute the control variate. DeepStack’s own value function estimate is perfectly suited for AIVAT. Indeed, when used with AIVAT we get an unbiased performance estimate with an impressive 85% reduction in standard deviation. Thanks to this technique, we can show statistical signiﬁcance (38) in matches with as few as 3,000 games. In total 44,852 games were played by the thirty-three players with 11 players completing the requested 3,000 games. Over all games played, DeepStack won 492 mbb/g. This is over 4 stan- dard deviations away from zero, and so, highly signiﬁcant. Note that professional poker players consider 50 mbb/g a sizable margin. Using AIVAT to evaluate performance, we see DeepStack was overall a bit lucky, with its estimated performance actually 486 mbb/g. However, as a lower variance estimate, this margin is over 20 standard deviations from zero. The performance of individual participants measured with AIVAT is summarized in Fig- ure 4. Amongst those players that completed the requested 3,000 games, DeepStack is esti- mated to be winning by 394 mbb/g, and individually beating 10 out of 11 such players by a statistically signiﬁcant margin. Only for the best performing player, still estimated to be losing by 70 mbb/g, is the result not statistically signiﬁcant. More details on the participants and their results are presented in (10). 10 Table 1: Exploitability bounds from Local Best Response. For all listed programs, the value reported is the largest estimated exploitability when applying LBR using a variety of different action sets. Table S2 gives a more complete presentation of these results (10). : LBR was unable to identify a positive lower bound for DeepStack’s exploitability. Program LBR (mbb/g) Hyperborean (2014) 4675 Slumbot (2016) 4020 Act1 (2016) 3302 Always Fold 750 DeepStack 0 Exploitability The main goal of DeepStack is to approximate Nash equilibrium play, i.e., minimize exploitabil- ity. While the exact exploitability of a HUNL poker strategy is intractable to compute, the re- cent local best-response technique (LBR) can provide a lower bound on a strategy’s exploitabil- ity (21) given full access to its action probabilities. LBR uses the action probabilities to compute the strategy’s range at any public state. Using this range it chooses its response action from a ﬁxed set using the assumption that no more bets will be placed for the remainder of the game. Thus it best-responds locally to the opponent’s actions, providing a lower bound on their overall exploitability. As already noted, abstraction-based programs from the Annual Computer Poker Competition are highly exploitable by LBR: four times more exploitable than folding every game (Table 1). However, even under a variety of settings, LBR fails to exploit DeepStack at all — itself losing by over 350 mbb/g to DeepStack (10). Either a more sophisticated lookahead is required to identify DeepStack’s weaknesses or it is substantially less exploitable. Discussion DeepStack defeated professional poker players at HUNL with statistical signiﬁcance (39), a game that is similarly sized to go, but with the added complexity of imperfect information. It achieves this goal with little domain knowledge and no training from expert human games. The implications go beyond being a milestone for artiﬁcial intelligence. DeepStack represents a paradigm shift in approximating solutions to large, sequential imperfect information games. Abstraction and ofﬂine computation of complete strategies has been the dominant approach for almost 20 years (33,40,41). DeepStack allows computation to be focused on speciﬁc situations that arise when making decisions and the use of automatically trained value functions. These are two of the core principles that have powered successes in perfect information games, albeit conceptually simpler to implement in those settings. As a result, the gap between the largest 11 perfect and imperfect information games to have been mastered is mostly closed. With many real world problems involving information asymmetry, DeepStack also has im- plications for seeing powerful AI applied more in settings that do not ﬁt the perfect information assumption. The abstraction paradigm for handling imperfect information has shown promise in applications like defending strategic resources (42) and robust decision making as needed for medical treatment recommendations (43). DeepStack’s continual re-solving paradigm will hopefully open up many more possibilities. References and Notes 1. G. Tesauro, Communications of the ACM 38, 58 (1995). 2. J. Schaeffer, R. Lake, P. Lu, M. Bryant, AI Magazine 17, 21 (1996). 3. M. Campbell, A. J. Hoane Jr., F. Hsu, Artiﬁcial Intelligence 134, 57 (2002). 4. D. Ferrucci, IBM Journal of Research and Development 56, 1:1 (2012). 5. V. Mnih, et al., Nature 518, 529 (2015). 6. D. Silver, et al., Nature 529, 484 (2016). 7. A. L. Samuel, IBM Journal of Research and Development 3, 210 (1959). 8. L. Kocsis, C. Szepesv ´ ari, Proceedings of the Seventeenth European Conference on Machine Learning (2006), pp. 282–293. 9. J. Bronowski, The ascent of man, Documentary (1973). Episode 13. 10. See Supplementary Materials. 11. In 2008, Polaris defeated a team of professional poker players in heads-up limit Texas hold’em (44). In 2015, Cepheus essentially solved the game (18). 12. V. L. Allis, Searching for solutions in games and artiﬁcial intelligence, Ph.D. thesis, Uni- versity of Limburg (1994). 13. M. Johanson, Measuring the size of large no-limit poker games, Technical Report TR13-01, Department of Computing Science, University of Alberta (2013). 14. M. Zinkevich, M. Johanson, M. Bowling, C. Piccione, Advances in Neural Information Processing Systems 20 (2008), pp. 905–912. 15. A. Gilpin, S. Hoda, J. Pe ˜ na, T. Sandholm, Proceedings of the Third International Workshop On Internet And Network Economics (2007), pp. 57–69. 12 16. End-game solving (17, 45, 46) is one exception to computation occurring prior to play. When the game nears the end, a new computation is invoked over the remainder of the game. Thus, the program need not store this part of the strategy or can use a ﬁner-grained abstraction aimed to improve the solution quality. We discuss this as re-solving when we introduce DeepStack’s technique of continual re-solving. 17. N. Burch, M. Johanson, M. Bowling, Proceedings of the Twenty-Eighth Conference on Artiﬁcial Intelligence (2014), pp. 602–608. 18. M. Bowling, N. Burch, M. Johanson, O. Tammelin, Science 347, 145 (2015). 19. We use milli-big-blinds per game (mbb/g) to measure performance in poker, where a milli- big-blind is one thousandth of the forced big blind bet amount that starts the game. This normalizes performance for the number of games played and the size of stakes. For com- parison, a win rate of 50 mbb/g is considered a sizable margin by professional players and 750 mbb/g is the rate that would be lost if a player folded each game. The poker community commonly uses big blinds per one hundred games (bb/100) to measure win rates, where 10 mbb/g equals 1 bb/100. 20. J. Wood, Doug Polk and team beat Claudico to win$100,000 from Microsoft & The Rivers
Casino, Pokerfuse, http://pokerfuse.com/news/media-and-software/26854-doug-polk-and-
team-beat-claudico-win-100000-microsoft/ (2015).
21. V. Lis
´
y, M. Bowling, Proceedings of the AAAI-17 Workshop on Computer Poker and Im-
perfect Information Games (2017). https://arxiv.org/abs/1612.07547.
22. DeepStack is not the ﬁrst application of deep learning to the game of poker. Previous ap-
plications of deep learning, though, are either not known to be theoretically sound (47), or
have only been applied in small games (48).
23. J. F. Nash, Proceedings of the National Academy of Sciences 36, 48 (1950).
24. When the previous solution is an approximation, rather than an exact Nash equilibrium,
sound re-solving needs the expected values from a best-response to the player’s previous
strategy. In practice, though, using expected values from the previous solution in self-play
often works as well or better than best-response values (10).
25. A. Gilpin, T. Sandholm, T. B. Sørensen, Proceedings of the Seventh International Confer-
ence on Autonomous Agents and Multi-Agent Systems (2008), pp. 911–918.
26. D. Schnizlein, M. Bowling, D. Szafron, Proceedings of the Twenty-First International Joint
Conference on Artiﬁcial Intelligence (2009), pp. 278–284.
27. A. Gilpin, T. Sandholm, T. B. Sørensen, Proceedings of the Twenty-Second Conference on
Artiﬁcial Intelligence (2007), pp. 50–57.
13
28. M. Johanson, N. Burch, R. Valenzano, M. Bowling, Proceedings of the Twelfth Interna-
tional Conference on Autonomous Agents and Multi-Agent Systems (2013), pp. 271–278.
29. A. Krizhevsky, I. Sutskever, G. E. Hinton, Advances in Neural Information Processing
Systems 25 (2012), pp. 1106–1114.
30. G. Hinton, et al., IEEE Signal Processing Magazine 29, 82 (2012).
31. A. van den Oord, et al., CoRR abs/1609.03499 (2016).
32. K. He, X. Zhang, S. Ren, J. Sun, Proceedings of the 2015 IEEE International Conference
on Computer Vision (ICCV) (2015), pp. 1026–1034.
33. J. Shi, M. L. Littman, Proceedings of the Second International Conference on Computers
and Games (2000), pp. 333–345.
34. D. P. Kingma, J. Ba, Proceedings of the Third International Conference on Learning Rep-
resentations (2014).
35. P. J. Huber, Annals of Mathematical Statistics 35, 73 (1964).
36. International Federation of Poker. http://pokerfed.org/about/ (Accessed January 1, 2017).
37. N. Burch, M. Schmid, M. Moravcik, M. Bowling, Proceedings of the AAAI-
17 Workshop on Computer Poker and Imperfect Information Games (2017).
http://arxiv.org/abs/1612.06915.
38. Statistical signiﬁcance where noted was established using a two-tailed Student’s t-test at
the 0.05 level with N 3000.
39. Subsequent to our study, the computer program Libratus, developed at CMU by Tuomas
Sandholm and Noam Brown, defeated a team of four professional heads-up poker spe-
cialists in a HUNL competition held January 11-30, 2017. Libratus has been described as
using “nested endgame solving” (49), a technique with similarities to continual re-solving,
but developed independently. Libratus employs this re-solving when close to the end of the
game rather than at every decision, while using an abstraction-based approach earlier in the
game.
40. D. Billings, et al., Proceedings of the Eighteenth International Joint Conference on Artiﬁ-
cial Intelligence (2003), pp. 661–668.
41. T. Sandholm, AI Magazine 31, 13 (2010).
42. V. Lisy, T. Davis, M. Bowling, Proceedings of the Thirtieth Conference on Artiﬁcial Intel-
ligence (2016), pp. 544–550.
14
43. K. Chen, M. Bowling, Advances in Neural Information Processing Systems 25 (2012), pp.
2078–2086.
44. J. Rehmeyer, N. Fox, R. Rico, Wired 16.12, 186 (2008).
45. S. Ganzfried, T. Sandholm, Proceedings of the Fourteenth International Conference on
Autonomous Agents and Multi-Agent Systems (2015), pp. 37–45.
46. M. Moravcik, M. Schmid, K. Ha, M. Hlad
´
ık, S. J. Gaukrodger, Proceedings of the Thirtieth
Conference on Artiﬁcial Intelligence (2016), pp. 572–578.
47. N. Yakovenko, L. Cao, C. Raffel, J. Fan, Proceedings of the Thirtieth Conference on Arti-
ﬁcial Intelligence (2016), pp. 360–367.
48. J. Heinrich, D. Silver, arXiv preprint arXiv:1603.01121 (2016).
49. N. Brown, T. Sandholm, Proceedings of the AAAI-17 Workshop on Computer Poker and
Imperfect Information Games (2017).
50. M. Zinkevich, M. Littman, Journal of the International Computer Games Association 29,
166 (2006). News item.
51. D. Morrill, Annual computer poker competition poker GUI client,
https://github.com/dmorrill10/acpc poker gui client/tree/v1.2 (2012).
52. O. Tammelin, N. Burch, M. Johanson, M. Bowling, Proceedings of the Twenty-Fourth In-
ternational Joint Conference on Artiﬁcial Intelligence (2015), pp. 645–652.
53. R. Collobert, K. Kavukcuoglu, C. Farabet, BigLearn, NIPS Workshop (2011).
54. S. Ganzfried, T. Sandholm, Proceedings of the Twenty-Eighth Conference on Artiﬁcial In-
telligence (2014), pp. 682–690.
Acknowledgements. The hand histories of all games played in the human study as well as
those used to generate the LBR results against DeepStack are in (10). We would like to thank
the IFP, all of the professional players who committed valuable time to play against DeepStack,
the anonymous reviewers’ invaluable feedback and suggestions, and R. Holte, A. Brown, and
K. Bla
ˇ
zkov
´
**Huber Loss** is a loss function that is less sensitive to outliers in data than the squared error loss. This is a massive number. Chess has approximately $10^{47}$ decision points. The estimated number of atoms in the universe is $10^{80}$ In heads-up poken one player will post the big blind and their opponent will post the small blind. If your strategy is to always fold, you will be loosing at a rate of 750 *mbb/g* (you are loosing *1.5* big blinds every 2 games - big blind + small blind). In 1944 John von Neumann and Oskar Morgenstern co-authored a book entitled **Theory of Games and Economic Behavior**. ![Theory of Games and Economic Behavior cover](https://upload.wikimedia.org/wikipedia/en/1/1e/VNM-TGEB.jpg) This was a groundbreaking book that would go on to essentially kickstart the field of game theory. John Nash would go on to greatly expand and explore the field of Game Theory. **mbb/g** - milli-big blinds per game. In poker, blinds are forced bets posted by players on each round. Usually there is a small blind and a big blind (twice the small blind). If the big blind is \$20 and you are averaging 500 *mbb/g*, your expected winnings are \$10 per hand. There are 4 main categories of games depending on whether the game is deterministic or not and on whether or not it contains perfect information. Deterministic games contain no element of chance and the outcome is solely determined by the skill and tactics of the players (e.g. chess). The distinction between perfect or imperfect information games depends upon whether the competitors have all the information relating to the game state. In this sense, Checkers is a perfect information game (both players can see everything that is going on the board) whereas Battleships is not. ![games categories](http://i.imgur.com/59biDnO.png) This figure provides a great high level picture of DeepStack. As the figure indicates all that DeepStack keeps track of is: - **Agent’s Range** - The probability distribution over the agent’s possible hands taking into account the current state of the game. - ** Opponent’s counterfactual values ** - Set of expected values for each hand the opponent might have given the current state. #### Heads-up no-limit Texas hold’em Common variant of poker played between two players with no limits on bets. In contrast, *heads-up **limit** Texas hold’em* "was essentially weakly solved” in 2015 ([paper](http://ai.cs.unibas.ch/_files/teaching/fs15/ki/material/ki02-poker.pdf)) Multi-player poker is still a considerably harder challenge than 1 vs. 1. ### Nash Equilibrium Let’s say you were playing rock, paper scissors with an opponent. The strategy that is Game Theory Optimal and therefore achieves a Nash equilibrium is to play each one of your 3 available actions with the same frequency ($1/3$). With 2 perfect players playing, no participant can gain by making a change to this strategy. It is worth noting that if your player is not a rational / perfect player, the Game Theory Optimal strategy might not be the one that yields the best results. If you are playing against an opponent that plays rock half of the times and scissors the other half, then the best strategy is to always play rock. **Counterfactual Regret Minimization (CFR)** is an algorithm by which an agent learns how to play a game by repeatedly playing against itself. The idea is that you initialize your strategy with a uniform distribution over the actions available to you. You then simulate playing the game against yourself repeatedly and each time, you revisit your strategy and update it. You improve your strategy by minimizing the **regret** for each action at each decision point. Regret is how much better you would have done overall (over all the games you’ve played) if at a certain decision point, you had always chosen a specific action available to you at the time (vs using the probability distribution over the possible actions provided by your strategy). You then go back and recompute your strategy with probabilities proportional to the regret for each action. [Here is a good resource](http://modelai.gettysburg.edu/2013/cfr/) in case you want to learn more about CFR.