-
Notifications
You must be signed in to change notification settings - Fork 0
/
old-iioos.tex
951 lines (821 loc) · 61.6 KB
/
old-iioos.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
%File: formatting-instruction.tex
\documentclass[letterpaper]{article}
\usepackage{aaai}
\usepackage{times}
\usepackage{helvet}
\usepackage{courier}
\usepackage{color}
\usepackage{textcomp}
\usepackage{amsthm}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{xcolor}
\usepackage{comment}
\usepackage{xspace}
\usepackage{tikz}
\usepackage{epsdice}
\usetikzlibrary{trees}
\usetikzlibrary{shapes.geometric}
\tikzset{
treenode/.style = {align=center, inner sep=0pt, font=\sffamily},
ma/.style = {draw,treenode, shape border rotate=90, isosceles triangle,isosceles triangle apex angle=60, black, minimum width=2mm},% arbre rouge noir, noeud noir
mi/.style = {ma, shape border rotate=-90},
ch/.style = {draw, treenode, circle, minimum width=2mm, black}
}
\tikzstyle{level 1}=[level distance=8mm, sibling distance=1.5cm]
\tikzstyle{level 2}=[level distance=8mm, sibling distance=1cm]
\tikzstyle{level 3}=[level distance=8mm, sibling distance=0.5cm]
\usepackage[algo2e, noend, noline, linesnumbered]{algorithm2e}
% need these next two lines for old versions of alg2e
\providecommand{\SetAlgoLined}{\SetLine}
\providecommand{\DontPrintSemicolon}{\dontprintsemicolon}
\DontPrintSemicolon
\makeatletter
\newcommand{\pushline}{\Indp}% Indent
\newcommand{\popline}{\Indm}
\makeatother
\newcommand{\argmin}{\operatornamewithlimits{argmin}}
\newcommand{\argmax}{\operatornamewithlimits{argmax}}
\newcommand{\bE}{\mathbb{E}}
\newcommand{\bx}{\mathbf{x}}
\newcommand{\bg}{\mathbf{g}}
\newcommand{\bu}{\mathbf{u}}
\newcommand{\bU}{\mathbf{U}}
\newcommand{\cI}{\mathcal{I}}
\newcommand{\cC}{\mathcal{C}}
\newcommand{\cP}{\mathcal{P}}
\newcommand{\cQ}{\mathcal{Q}}
\newcommand{\tta}{\mathtt{a}}
\newcommand{\tth}{\mathtt{h}}
\newcommand{\ttz}{\mathtt{z}}
\newcommand{\PW}{\mbox{PW}}
\newcommand{\BR}{\mbox{BR}}
\newcommand{\defword}[1]{\textbf{\boldmath{#1}}}
\newcommand{\ie}{{\it i.e.}\xspace}
\newcommand{\eg}{{\it e.g.}\xspace}
\newtheorem{definition}{Definition}
\newtheorem{fact}{Fact}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}
\newtheorem{lemma}{Lemma}
\newtheorem{proposition}{Proposition}
\newcommand{\Proof}{{\noindent\bf Proof. }}
\newcommand{\citejustyear}[1]{\cite{#1}}
\newcommand{\Qed}{$\blacksquare$}
\newcommand{\abs}[1]{\left|#1\right|}
\newcommand{\todo}[1]{{\color{red}{\bf #1}}}
\newcommand{\breturn}{{\bf return}\xspace}
% the note center!
\definecolor{darkgreen}{RGB}{0,125,0}
\newcounter{vlNoteCounter}
\newcounter{mlNoteCounter}
\newcounter{asNoteCounter}
\newcommand{\vlnote}[1]{{\scriptsize \color{blue} $\blacksquare$ \refstepcounter{vlNoteCounter}\textsf{[VL]$_{\arabic{vlNoteCounter}}$:{#1}}}}
\newcommand{\mlnote}[1]{{\scriptsize \color{darkgreen} $\blacksquare$ \refstepcounter{mlNoteCounter}\textsf{[ML]$_{\arabic{mlNoteCounter}}$:{#1}}}}
\newcommand{\asnote}[1]{{\scriptsize \color{red} $\blacktriangle$ \refstepcounter{asNoteCounter}\textsf{[AS]$_{\arabic{asNoteCounter}}$:{#1}}}}
%\newcounter{NoteCounter}
%\newcommand{\vlnote}[1]{{\scriptsize \color{blue} \refstepcounter{vlNoteCounter}\textsf{[VL]$_{\arabic{vlNoteCounter}}$:{#1}}}}
%\renewcommand{\vlnote}[1]{}
\frenchspacing
\setlength{\pdfpagewidth}{8.5in}
\setlength{\pdfpageheight}{11in}
\pdfinfo{
/Title (Searching Imperfect Information Games using Online Counterfactual Regret Minimization)
/Author (Marc Lanctot, Viliam Lis\'{y}, and Michael Bowling)}
%\setcounter{secnumdept}{0}
\begin{document}
% The file aaai.sty is the style file for AAAI Press
% proceedings, working notes, and technical reports.
%
\title{Search in Imperfect Information Games using Online\\Monte Carlo Counterfactual Regret Minimization}
\author{Marc Lanctot$^1$, Viliam Lis\'{y}$^2$, and Michael Bowling$^3$\\
~~~~\\
$^1$Department of Knowledge Engineering, Maastricht University, The Netherlands\\
$^2$Department of Computer Science, FEE, Czech Technical University in Prague\\
$^3$Department of Computing Science, University of Alberta, Canada\\
{\small marc.lanctot@maastrichtunviersity.nl,~viliam.lisy@agents.fel.cvut.cz,~bowling@cs.ualberta.ca}
}
%$^2$Department of Computer Science, Czech Technical University in Prague\\
\maketitle
\begin{abstract}
Online search in games has always been a core interest of artificial intelligence.
Advances made in search for perfect information games (such as Chess, Checkers, Go,
and Backgammon) have led to AI capable of defeating the world's top human experts.
Search in imperfect information games (such as Poker, Bridge, and Skat) is significantly
more challenging due to the complexities introduced by hidden information. In this
paper, we present Online Outcome Sampling (OOS), the first imperfect information
search algorithm that is guaranteed to converge to an equilibrium strategy in two-player
zero-sum games. We show that OOS avoids common problems encountered by existing
search algorithms and we experimentally evaluate its convergence rate and practical
performance against benchmark strategies in Liar's Dice and a variant of Goofspiel.
We show that unlike with Information Set Monte Carlo Tree Search (ISMCTS) the
exploitability of the strategies produced by OOS decreases as the amount of search
time increases. In practice, OOS performs as well as ISMCTS in head-to-head play
while producing strategies with lower exploitability given the same search time.
\end{abstract}
\section{Introduction}
%Main points:
%\begin{itemize}
%\item motivate importance of convergence to NE.
%\item why online search vs. offline equilibrium computation
%\item existing methods may not converge over time (it would be great to have a motivating example that convinces the reader that any determinization-based II search algorithm cannot converge in general - some kind of search game could be a good example)
%\item introduce the first one that does, and show that it can compete with the others in practice
%\end{itemize}
In many sequential multi-agent interactions, agents have some initial time to prepare for the interaction and then after each decision, have additional thinking time to decide about their next move. When preparation time is abundant and the computational resources are sufficient, an equilibrium strategy for a smaller abstract game can be pre-computed and then used during the game play. This {\it offline approach} has been remarkably successful in
Computer Poker \cite{Sandholm10The,Rubin11Poker,Gilpin09,Johanson07Msc}.
However, the preparation time is often very limited. The exact model of the game may become known only shortly before acting is necessary, such as in general game-playing, security enforcement in a previously unknown environment, and general-purpose robotics. In a short time, only a very small abstract game could be solved in advance. Moreover, it might not even be possible to create a sufficiently small and still useful abstraction to be solved in time. In these cases, agents may need to {\it decide online}: make initial decisions quickly and then put additional effort to improving their strategy in the current situation while the interaction is taking place.
%Search in game-playing has been a classic interest of artificial intelligence.
%focused on perfect information games,
%When restricted to perfect information settings, the field of online search has benefited from decades of research in game-playing.
%as evidenced by events such as defeat of human chess champions,
%solving of checkers, and the growing strength of computer Go programs~\cite{Campbell02deepblue,Schaeffer07gameover,Gelly12}.
%Due to the complexity and problems introduced by hidden information, search in imperfect information settings has received comparatively much less attention.
%This is unfortunate as many sequential decision-making problems have some element of hidden information.
%Modern approaches to imperfect information search to sample possible ``world'' states, then run
%searches rooted from these states, aggregating their results to choose an action to play. The most extreme forms
%\eg Perfect Information
%Monte Carlo~\cite{Long10Understanding}
%``average over clairvoyance''~\cite{AIBook},
% ML-Note Nov2: @Viliam, PIMC *is* "averaging over clairvoyance"
%ignoring the information structure completely. While this is sufficient in some games, it tends to have trouble in games like poker
%where the information structure is critical.
%; similar problems have been reported using Monte Carlo Tree Search (MCTS) in Kriegspiel~\cite{Ciancarini10Kriegspiel}.
%Most recent algorithms try to fix the problem by accounting for the information structure during the
%search in some way. These approaches seem to improve performance in Kriegspiel, Skat~\cite{Furtak13Recursive},
%Scotland Yard~\cite{Nijssen12SY}, Dou Di Zhu, Magic: The Gathering, and other card games~\cite{Cowling12MTG,Cowling12ISMCTS}.
%For example, the notion of a subgame is required for backward induction,
%which most perfect information search algorithms are based on, is not well-defined in imperfect information games.
%We present results in two domains: Liar's Dice and a variant of Goofspiel.
%PIMC~\cite{Long10Understanding}, ISMCTS~\cite{Cowling12ISMCTS}, and MMCTS~\cite{Auger11Multiple}.
%\vlnote{Current intro targets particularly the search community.}
%\section{Background and Related Work}
%Overview of past algorithms and the current state-of-the-art (we have to ``weave'' the related work below into a nice overview).
%Define determinization, strategy fusion, non-locality, disambiguation factor.
%References to help:
%\begin{itemize}
%\item Original/first II search~\cite{Frank98Finding}
%\item Work on Bridge~\cite{Ginsberg01}
%\item Information Set Search~(ISS)~\cite{Parker10iss,Parker06paranoia}
%%\vlnote{\cite{Parker10iss} is a little more detailed, but I guess it is from a less reputable forum.}
%We have used ISS in MCTS setting in a visibility-based pursuit-evasion game~\cite{Lisy12peg}
%\item Work on Hearts / Spades~\cite{Sturtevant08An}
%\item PIMC~\cite{Long10Understanding} and other work on Skat (Jeff's thesis)
%\item Counter-examples in simple games~\cite{Shafiei09,Ponsen11Computing}
%\item MMCTS (Phantom Tic-Tac-Toe)~\cite{Auger11Multiple}
%\item Work on Exp3 for Urban rivals~\cite{Teytaud11Upper,StPierre12Online}
%\item IS-MCTS~\cite{Cowling12ISMCTS} and other work by that group (on Dou di Zhu, MTG etc. \cite{Whitehouse11DDZ,Cowling12MTG})
%\item IIMC~\cite{Furtak13Recursive}
%\item Scotland Yard~\cite{Nijssen12SY}
%\item MCTS in Kriegspiel~\cite{Ciancarini10Kriegspiel}
%\item Kurt's past work on MCTS for Poker~\cite{vdbroek09MCTSPoker}?
%%\item MCTS in Kriegspiel~\cite{Ciancarini09Kriegspiel}
%%\vlnote{why not the AIJ 2010 paper?}
%% ML: I did not know about it.
%\item Maybe quick mention of recent excitment in simultaneous move games (SMAB, double-oracle and serialized AB algorithm that doesn't have a name, SM-MCTS, and NIPS paper
%\item Mention that OOS has been applied in SM-MCTS setting.
%\end{itemize}
% classic work: PIMC and its successes
Some of the first imperfect information search algorithms were inspired by the application to
Bridge~\cite{Frank98Finding,Ginsberg96Partition}, and notably by the success of GIB~\cite{Ginsberg01}.
In GIB, perfect information search is performed on a {\it determinized}
instance of the game: one where all players can see usually hidden information, a concept
originally proposed in~\cite{Levy89Heuristic}. This approach has also performed well in other
games like Scrabble~\cite{Sheppard02World}, Hearts~\cite{Sturtevant08An}, and Skat~\cite{Buro09Improving}.
% overview of problems with PIMC, proposed solutions
There are several problems that have been identified with Perfect Information Monte Carlo (PIMC) search techniques
on determinized samples. The most common are {\it strategy fusion} and {\it non-locality}~\cite{Frank98Finding}.
Strategy fusion occurs when an algorithm is allowed to choose two different actions from two different states in the same
information set, which is inconsistent with the constraints imposed by the game's information structure.
Non-locality occurs due to the assumption that subgames are well-defined and that hence search can be
recursively applied by computing and comparing values of subgames.
Strategy fusion can be overcome by imposing the proper information constraints during search, as was done in
Kriegspiel~\cite{Ciancarini10Kriegspiel}, Kuhn Poker~\cite{Ponsen11Computing}, pursuit-evasion games~\cite{Lisy12peg},
card games and other phantom games~\cite{Cowling12ISMCTS}.
Non-locality, however, is a fundamental problem that prevents the usual minimax formulation of optimal payoffs defined strictly over
children subtrees.
There is empirical evidence suggesting that some of these techniques will not converge to the optimal solution even in very
small games like Biased Rock-Paper-Scissors and Kuhn poker~\cite{Shafiei09,Ponsen11Computing}.
%The authors of \cite{Parker10iss,Parker06Overconfidence} showed that minimax-style backups at opponent information sets can lead to overly paranoid search.
%We also make similar observations for ISMCTS in Liar's Dice.
% other missing related works
%One particularly related work is the MMCTS algorithm~\cite{Auger11Multiple} that has been
%applied to Phantom Tic-Tac-Toe.
%MMCTS uses Exp3 at each information set in its search, and scales the payoff
%received by the depth of each playout.
%While we have not implemented MMCTS, we have been able to test our
%algorithm against it thanks to the authors sharing their code.
% Poker and the game-theoretic offline approach
%Billings04Game
Game-tree search techniques have also been applied to Poker~\cite{Billings04Game}. In recent years, however,
the most common approach to producing strong Poker AI has been use advances in equilibrium approximation
algorithms to compute a near-optimal equilibrium of an abstract game and subsequently use the abstract strategies
when playing the full game~\cite{Sandholm10The}. We refer to this as the {\it offline approach} because it
requires designing domain-specific abstractions and precomputing approximate equilibria offline in advance, which could take
several weeks or even months. In the offline approach, during actual play an agent simply looks up its precomputed strategy
stored in a large table.
This offline approach is often not possible. For example, an agent may be required to
act shortly after discovering the rules, such as in general game playing, which leaves very little time for precomputation of
equilibria or designing of domain-specific abstractions.
Also, due to the size of the abstractions required for strong play~\cite{Johanson13Evaluating}, there may not be enough space to
store the precomputed strategy, such as in gaming apps on mobile devices.
%The main problem with current techniques is a lack of theoretical foundations of the algorithms. There are no guarantees that the algorithms will always find a good strategy in a game, given a sufficient thinking time. In fact,
In this paper, we focus on the {\it online} setting. We propose a Monte Carlo Tree Search (MCTS)~\cite{mctssurvey} algorithm that
uses Monte Carlo Counterfactual Regret Minimization as its basis rather than Upper Confidence Bounds~\cite{UCB,UCT}.
%We start by formalizing the problem of imperfect information search from the ground up. That is, we return to the game-theoretic
%foundations upon which search in games is built to motivate the importance of Nash equilibrium as the optimal strategy in zero-sum games. We explain the subtleties introduced by imperfect information that cause the current approaches not to converge to the optimal strategy.
We introduce Online Outcome Sampling (OOS), a simulation-based algorithm that builds its search tree incrementally,
like MCTS.
We show that OOS converges to an equilibrium strategy as search time per move increases, making it the first
known imperfect information search algorithm satisfying this property, which we call {\it consistency}.
%thermore, we analyze methods for incorporating the information obtained by the players in the game to the search process, so that the consistency property is preserved.
We show the empirical convergence rates and performance of OOS, comparing them to benchmark players and to ISMCTS~\cite{Cowling12ISMCTS}, a
recent algorithm currently used in a popular mobile phone implementation of Spades~\cite{Whitehouse13Integrating}.
%\vlnote{I do not understand the point of the last two sentences.}
% MarcL: I'm not really sure why you didn't understand, but I clarified the argument by expanding the entire paragrap. Email me if it's still a problem.
\subsection{Extensive-Form Games}
%Basic game theoretic definitions. Behavioral strategies.
%CFR~\cite{CFR} and MCCFR~\cite{Lanctot09Sampling}. Success in Poker.
Here, we define the relevant game-theoretic terminology that forms the basis
of our analysis. The notation used here is based on~\cite{OsbRub94}.
%We then describe
%the fundamental concepts required to describe OOS.
%For a comprehensive introduction and
%survey of the fundamental topics, see~\cite{ShoLB08}.
In this paper, we focus on two-player zero-sum extensive form games, which model sequential decision making of agents called \defword{players} denoted
$i \in N = \{ 1, 2 \}$. In turn, players choose \defword{actions} leading to sequences called \defword{histories} $h \in H$.
A history $z \in Z$, where $Z \subseteq H$, is called a \defword{terminal history} and represents a full game from start to end.
At each terminal history $z$ there is a payoff $u_i(z)$ in $[-\Delta,\Delta]$
to each player $i$. At each nonterminal history $h$, there is a single
current player to act, determined by $P: H \backslash Z \rightarrow N \cup \{ c \}$ where $c$ is a special player called \defword{chance}
(sometimes also called nature) that plays with a fixed stochastic strategy. For example, chance is used to represent rolls of dice
and card draws. The game starts in the empty history $\emptyset$, and
at each step, given the current history $h$, the current player chooses an action $a \in A(h)$ leading to successor history $h' = ha$;
in this case we call $h$ a \defword{prefix} of $h'$ and denote this relationship by $h \sqsubset h'$. Also, for all $h,h',h'' \in H$,
if $h \sqsubset h'$ and $h' \sqsubset h''$ then $h \sqsubset h''$, and $h \sqsubset h$. Each set $N$, $H$, $Z$, and $A(h)$ for every $h \in H$ are
finite and every history has finite length.
Define $\cI = \{ \cI_i~|~i \in N \}$ the set of information partitions.
$\cI_i$ is a partition over $H_i = \{ h~|~P(h) = i \}$ where each part is call an \defword{information set}.
Intuitively, an information set
$I \in \cI_i$ that belongs to player $i$ represents a state of the game with respect to what player $i$ knows.
Formally, $I$ is a set of histories that a player cannot tell apart (due to information hidden from that player). For all
$h,h' \in I$, $A(h) = A(h')$ and $P(h) = P(h')$; hence, often we naturally extend the definition to $A(I)$, $P(I)$, and denote $I(h)$ the information set
containing $h$.
% ML-Note Nov2: I don't this we'll need this in this paper, but just in case...
%We also define the \defword{choice set} of (information set, action) pairs for one player to be
%$Q_i = \{ (I,a) \mid I \in \cI_i, a \in A(I) \} \cup \{ q_{\emptyset} \}$, where $q_{\emptyset}$ is
%the empty(root) choice.
%For a history $h \in H$, define
%$X_i(h) = (I,a), (I', a'), \cdots~$ to be the sequence of player $i$'s (information set,
%action) pairs (choices) that were encountered and taken to reach $h$ in the same order as they are encountered
%and taken along $h$. In this paper, every extensive-form game has \defword{perfect recall}, which means
%$\forall i \in N, \forall I \in \cI_i : h, h' \in I \Rightarrow X_i(h) = X_i(h')$. Intuitively,
%this means that player $i$ does not forget any information that they discovered during their play
%up to $h$.
%Denote $succ_i(I,a)$ the set of successor choices of player $i$, that is
%all $(I',a')$ such that $X_i(h') = X_i(h),~(I',a')$ where $h \in I, h' \in I'$.
%equilibrium definitions
A \defword{behavioral strategy} for player $i$ is a function mapping each information set $I \in \cI_i$
to a probability distribution over the actions $A(I)$, denoted by $\sigma_i(I)$.
%If every distribution in the range of this mapping assigns all of its weight on a single action, then the strategy is called \defword{pure}.
%A \defword{mixed} strategy is a single explicit distribution over pure strategies.
Given a profile $\sigma$, we denote the probability of reaching a terminal history $z$ under $\sigma$ as
$\pi^\sigma(z) = \prod_{i \in N} \pi_i(z)$, where each $\pi_i(z) = \prod_{ha \sqsubset z, P(h) = i} \sigma_i(I(h),a)$
is a product of probabilities of the actions taken by player $i$ along $z$.
We also use $\pi^{\sigma}_i(h,z)$ and $\pi^{\sigma}(h,z)$ to refer to the product
of only the probabilities of actions along the sequence from the end of $h$ to the end of $z$, where $h \sqsubset z$.
Define $\Sigma_i$ to be the set of behavioral strategies for player $i$.
As is convention, $\sigma_{-i}$ and $\pi_{-i}^\sigma$ refer to player $i$'s opponent strategy and products of the opponent's and chance's actions.
An \defword{$\epsilon$-Nash equilibrium}, $\sigma$, is a set of $\sigma_i$ for $i \in N$ such that the benefit of switching to some
alternative $\sigma_i'$ is limited by $\epsilon$, i.e.,
\begin{equation}
\label{eq:ne}
\max_{\sigma_i' \in \Sigma_i} u_i(\sigma_i', \sigma_{-i}) - u_i(\sigma) \le \epsilon
\end{equation}
holds for each player $i \in N$. When $\epsilon = 0$, the profile is simply called a Nash equilibrium.
When $|N| = 2$ and $u_1(z) + u_2(z) = k$ for all $z \in Z$, then the game is a two-player zero-sum game; these
games form an important subset of extensive-form games due to their worst-case guarantees: different equilibrium strategies result in
the same expected payoff against any arbitrary opponent equilibrium strategy and at least the same payoff for any opponent strategy at all. In this paper, we define the \defword{exploitability} of a profile to be the sum of both distances from Eq.~\ref{eq:ne},
$\epsilon_{\sigma} = \max_{\sigma_1' \in \Sigma_1} u_1(\sigma_1', \sigma_2) + \max_{\sigma_2' \in \Sigma_1} u_2(\sigma_1, \sigma_2')$.
\subsection{Counterfactual Regret Minimization}
Counterfactual Regret (CFR) is a notion of regret at the information set level for extensive-form games~\cite{CFR}.
The CFR algorithm iteratively learns strategies in self-play, converging to an equilibrium.
The \defword{counterfactual value} of reaching information set $I$ is the expected payoff given that player $i$ played to reach $I$, the opponents played
$\sigma_{-i}$ and both players played $\sigma$ after $I$ was reached:
\begin{equation}
\label{eq:cfv}
v_i(I,\sigma) = \sum_{(h,z) \in Z_I} \pi^{\sigma}_{-i}(h) \pi^{\sigma}(h,z) u_i(z),
\end{equation}
where $Z_I = \{ (h,z)~|~z \in Z, h \in I, h \sqsubset z \}$.
Suppose, at time $t$, player $i$ plays with strategy $\sigma^t_i$.
Define $\sigma^t_{I \rightarrow a}$ as identical to $\sigma^t$ except at $I$ action $a$ is taken with probability $1$.
The counterfactual regret of not taking $a \in A(I)$ at time $t$ is $r^t(I,a) = v_i(I,\sigma^t_{I \rightarrow a}) - v_i(I,\sigma^t)$.
The algorithm maintains the cumulative regret $R^T(I,a) = \sum_{t=1}^T r^t(I,a)$, for every action at every information set of every player.
Then, the distribution at each information set for the next iteration $\sigma^{T+1}(I)$ is obtained individually using
regret-matching~\cite{Hart00}. The distribution is proportional to the positive portion of the individual actions' regret:
\begin{equation*}
\label{eq:rm}
\sigma^{T+1}(I,a) = \left\{
\begin{array}{ll}
R^{T,+}(I,a) / R^{T,+}_{sum}(I) & \mbox{if } R^{T,+}_{sum}(I) > 0 \\
1 / |A(I)| & \mbox{otherwise,}
\end{array} \right.
\end{equation*}
where $x^+ = \max(0,x)$ for any term $x$, and $R^{T,+}_{sum}(I) = \sum_{a' \in A(I)} R^{T,+}(I,a')$. Furthermore, the algorithm maintains for each information set the average strategy profile
\begin{equation}
%\bar{\sigma}^T(I,a) = \frac{1}{T}\sum_{t=1}^T \sigma^t(I,a).
\bar{\sigma}^T(I,a) = \frac{\sum_{t=1}^T \pi^{\sigma^t}_i(I) \sigma^t(I,a)}{\sum_{t=1} \pi^{\sigma^t}_i(I)},
\end{equation}
where $\pi^{\sigma^t}_i(I) = \sum_{h \in I}\pi^{\sigma^t}_i(h)$.
The combination of the counterfactual regret minimizers in individual information sets also minimizes the overall average regret \cite{CFR}, and hence the average profile is a $2\epsilon$-equilibrium, with $\epsilon \rightarrow 0$
as $T \rightarrow \infty$.
Monte Carlo Counterfactual Regret Minimization (MCCFR) applies CFR to sampled portions of the games~\cite{Lanctot09Sampling}.
In the \defword{outcome sampling} (OS) variant of the algorithm, a single terminal history $z\in Z$ is sampled in each iteration.
The algorithm updates the regret in the information sets visited along $z$ using the
\defword{sampled counterfactual value},
\begin{equation*}
\tilde{v}_i(I,\sigma) = \left\{
\begin{array}{ll}
\frac{1}{q(z)} \pi^{\sigma}_{-i}(h) \pi^{\sigma}(h,z) u_i(z) & \mbox{if } (h,z) \in Z_I\\
0 & \mbox{otherwise,}
\end{array} \right.
\end{equation*}
where $q(z)$ is the probability of sampling $z$.
As long as every $z \in Z$ has non-zero probability of being sampled, $\tilde{v}_i(I,\sigma)$ is an unbiased estimate of $v(I,\sigma)$
due to the importance sampling correction ($1/q(z)$). For this reason, applying CFR updates using these sampled counterfactual values
on the sampled information sets values also eventually converges to the approximate equilibrium of the game with high probability.
The required number of iterations for convergence is much larger, but each iteration is much faster.
In Poker, CFR and MCCFR have been used with remarkable success as offline methods
for pre-computing approximate equilibria in abstract games~\cite{CFR,Johanson12CFRBR}; the same general
approach has also been used in Liar's Dice~\cite{Neller11,Lanctot12IR}.
%additional optimizations that can be applied in Liar's Dice~\cite{Neller11}.
%Essentially, CFR takes time (weeks?) {\it pre-computing}
%approximate equilibria on abstract poker games, which are then used to look up actions when a decision needs to be made during play.
\subsection{Subgames and Online Search}
%Complexities of online search vs. offline equilibrium computation. Notions of subgame perfection (sequential equilibrium).
%Perfect infromation games. Subject perfection
%Minimax and MCTS are online approximations of backward induction.
%Explain why this is not the case in imperfect information search.
%Show what can go wrong if you use a perfect information search method.
In a \defword{match} (online game), each player is allowed little or no preparation time before playing (preventing the offline advance computation of approximate equilibria solutions).
There is a current \defword{match history}, $\tth$, initially the empty history $\emptyset$ representing the start of the match. Each turn,
the agent controlling $P(\tth)$ is given $t$ time units to decide on a \defword{match action} $\tta \in A(\tth)$ and the
match history then changes using $\tth \leftarrow \tth \tta$. There is a single referee who knows $\tth$, samples chance outcomes
as needed from $\sigma_c(\tth)$, and reveals $I(\tth)$ to $P(\tth)$ on their turn. The players play until the match is terminated,
giving each player $i$ a payoff of $u_i(\ttz)$.
% Sort of reptitive from non-locality discussion from before
A perfect information game can be broken down into subgames and solved independently.
%An equilibrium is called subgame perfect if its portion used in every subgame is also an equilibrium in that subgame.
Every perfect information game has a pure subgame perfect equilibrium, which can be found by backward induction.
Search algorithms simply aim to identify the single optimal action at the search tree's root.
Imperfect information games cannot be easily broken down into subgames. In general, the optimal strategies in an inner
information set could be mixed and depend on the probabilities that individual nodes in the information set are
reached. These probabilities depend on the strategies of the players in the tree above the information set, which in
turn depend on the payoffs obtained from other parts of the tree.
%Hence, the ``subgames'' cannot be solved independently
%and the search problem is more complex.
Some previous work has been done for accelerating equilibrium computation by solving or approximately solving subgames, particularly
in end-game situations~\cite{Gilpin06Competitive,Gilpin07Better,Ganzfried13Endgame}. These techniques tend to help in practice,
but as shown in \cite{Ganzfried13Endgame}, non-locality prevents these methods from producing equilibrium strategies in general.
Using CFR to solve decomposed subgames has also recently been investigated~\cite{Burch14Solving}. Whether this decomposition
could be used in Monte Carlo simulations is an interesting topic of future work.
\subsection{Information Set Monte Carlo Tree Search}
The implementation details of ISMCTS described in \cite{Cowling12ISMCTS} have impact on efficiency of the method for various games, but in essence, the multi-observer variant of the algorithm incrementally builds the game tree similarly to MCTS, but it places a multi-armed bandit algorithm at each information set rather than at each state. After a non-root information set is reached during the match, further simulations are run from a random node in this information set with uniform probability. This prevents the algorithm from converging to the optimal solution in the game.
Consider the game in Figure~\ref{fig:coordGame}.
Suppose ISMCTS searches from information set $I$. Because utilities for actions taken from both states are combined, ISMCTS will choose left and right action equally often. However, mixing uniformly at $I$ is not part of an equilibrium in this game, since it would lead to an expected utility of $\frac{3}{4}$ to the maximizing player for playing right, which would give an incentive to the maximizer to always play left and in that case, the minimizer would be better off playing right, reaching expected reward of $0.5$.
\begin{figure}
\begin{center}
\begin{tikzpicture}
\node [ch] {}
child{ node [ma] (i1_1) {}
child{ node {1}}
child{ node {0}}
edge from parent node[left] {0.5}
}
child{ node [ma] (i1_2) {}
child{ node [mi] (i2_1) {}
child{ node {3}}
child{ node {0}}
}
child{ node [mi] (i2_2) {}
child{ node {0}}
child{ node {3}}
}
edge from parent node[right] {0.5}
};
\draw [dashed] (i1_1) -- (i1_2);
\draw [dashed] (i2_1) -- (i2_2) node[midway, above] {I};
\end{tikzpicture}
\end{center}
\caption{An example game with maximizing $\bigtriangleup$, minimizing $\bigtriangledown$ and chance $\bigcirc$ players. The optimal (Nash equilibrium) strategy is for $\bigtriangleup$ to play (0.5,0.5) and for $\bigtriangledown$ to play (left,right) with probabilities ($\frac{1}{3}$,$\frac{2}{3}$). Its value is $1.0$.}\label{fig:coordGame}
\end{figure}
\section{Online Outcome Sampling}
%Information set targeted search. Epistemic depth.
%Convergence theorems. Algorithm description.
%Recall outcome sampling (OS) from Section~\ref{sec:cfr}.
When outcome sampling is used in the offline setting, data structures for all information sets are allocated and created
before the first iteration starts. In each iteration, every information set that is sampled gets updated.
% MarcL: these are too strong and kind of miss the point, at least the way they were phrased..
%In the online setting, this is not practical. The limited computation time often does not allow even a single visit of each
%information set in the game tree. Moreover, the information sets far from the tree root will likely have only very few visits,
%making the statistics stored in these information sets practically useless.
We make two essential modifications to adapt outcome sampling to the online search setting.
{\bf Incremental Game Tree Building.} Before the match begins, only the very first (root) information set is added to memory.
In each iteration, a single information set (at most) is added the information set tree (memory) each iteration.
In particular, when an information set is reached that is not in memory, it is added to memory and then a default
playout policy (\eg uniform random) takes over for the remainder of the simulation.
Along the playout portion (tail) of the simulation, information sets are not added to memory nor updated.
Along the tree portion (head) of simulation, information sets are updated as normal.
This way, only the information sets with relevant statistics will be stored in the memory.
%We call this modification \defword{playout-based outcome sampling}
{\bf In-Match Search Targeting.}
%ime per move, the match can after a few moves easily
%reach a history $\tth$ in an information set that was visited in only a few iterations or even not present in the memory at all.
%The strategy suggested by the algorithm in such an information set can be completely arbitrary.
Suppose several moves have been played since the start of the match leading to $\tth$.
Plain outcome sampling would continue to sample from the root of the game (not the current match history $\tth$), entirely
disregarding the region of the game space that the match has headed toward.
Hence, the second modification we propose is directing the search towards the histories that are more likely to occur during the match currently played.
%Online Outcome Sampling (OOS) builds on playout-based OS by proposing two different ways to direct the search when starting with a particular non-empty match history, $\tth$.
Note that this history is typically unknown to the players, who know only their information sets it leads to.
Furthermore, unlike in ISMCTS, OOS always runs samples form the root of the game tree, even with non-empty match history.
We now describe two specific targeting methods.
\subsection{Information Set Targeting (IST)}
Suppose the match history is $\tth$. IST samples histories reaching the current information set ($I(\tth)$),
i.e., $(h,z) \in Z_{I(\tth)}$, with higher probability than other histories.
The intuition is that these histories are particularly
relevant since the searching player {\it knows} that one of these $z$ will describe the match at its completion.
However, focusing fully only on these histories may cause problems, since convergence guarantees are lost.
Consider again the game in Figure~\ref{fig:coordGame}.
If the minimizing player knows it is in the information set $I$ and focuses all its search only to this information set for sufficiently long, she computes the uniform strategy, which is optimal in the right (coordination) part of the game.
However, if the minimizing player plays uniformly, the maximizing player prefers to switch to always play the left action to increase its payoff in case of not playing the coordination game. Any fixed non-zero probability of sampling the left chance action will
eventually solve the problem. The regrets are multiplied by the reciprocal of the sampling probability; hence, they influence the strategy
in the information set proportionally stronger if the samples are rare.
Note that previous methods, such as PIMC and ISMCTS, {\it always} target $I(\tth)$, \ie with probability 1, and do not
update predecessors of $I(\tth)$. In contrast, in IST {\it all} information sets in memory reached during each iteration requires updating
to ensure eventual convergence to an equilibrium.
%What IST does is similar to what previous methods, such as PIMC and ISMCTS, do. However, there some important subtleties
%It is tempting to set $\delta_I$ high. The problem, however, is that
%focusing too much on $Z_{I(\tth)}$ could
%have the effect of revealing private information to the opponent.
%\vlnote{I agree this holds for Exp3-based MMCTS-like approaches, but I do not think it is true for MCCFR. Can you create a simple counter-example as above?} To see this, imagine the subgame
%defined by placing a single chance node over the histories $h \in I(\tth)$, whose distribution is obtained by previous chance event outcomes.
%The equilibrium strategy for the opponent in this subgame plays as if the opponent knows the searching player's private information. Therefore,
%we expect that the optimal value of $\delta_I$ will depend on the importance of the hidden information.
\subsection{Public Subgame Targeting (PST)}
A \defword{public action} is an action in the ``public tree'' defined in \cite{12aamas-pcs}. Informally, an action is said to be public if it is observable by all players (e.g., bids in Liar's Dice and bets in Poker are public). Formally, an action $a$ is public, iff
$\forall i, \forall I \in \cI_i, \forall h_1,h_2\in I: a\in h_1 \Leftrightarrow a\in h_2$.
For example, the extensive-form version of Rock, Paper, Scissors has two information
sets $I_1 = \emptyset$ and $I_2 = \{ r, p, s \}$; it has no public actions, because each history in
$I_2$ contains a single unique action (the unobserved ones taken by the first player).
%a sequence of legal actions from I to I', S(I,I'), is said to be a ``public action sequence'' if every two histories h,h' \in I: hL \in I' and h'L \in I' and S(I,I') describes a one-to-one correspondence between histories in I and histories in I'. A public action, a \in A(I), is one such that every sequence containing a as the first action, S(I,I') = a S(I'',I') where I' is a successor of I is a public action sequence.
Given a history $h$, let $p(h)$ be the sequence of public actions along $h$ in the same order that they were taken in $h$.
Define the \defword{public subgame} induced by $I$ to be the one whose terminal history set is
\[Z_{p,I(h)} = \{(h',z)~|~z \in Z, h' \in H, p(h') = p(h), h' \sqsubset z \}.\]
Now, suppose the match history is $\tth$.
Public subgame targeting samples $z \in Z_{p,I(\tth)}$ with higher probability than terminal histories outside this set.
A public subgame then, contains all the terminal histories consistent with the bidding sequence played over the match and
each combination of private chance events for both players. So, in a game of two-player limit Texas Hold'em poker, suppose
first player bets and second player calls, and then flop is revealed. At this point, the public actions are: bet, call.
The public subgame described by $Z_{p,I(h)}$ contains every terminal history (including every combination of private chance
outcomes for all players) with at least two public actions, whose first two public actions are: bet, call.
%Suppose $\delta_{p,\tth}$ is the probability that any $z$ in this set is sampled.
%Again, it is tempting to sample from this set with $\delta_{p,\tth} = 1$.
%However, this is also problematic because the probabilities to reach $\tth$ play a critical role in the convergence in the full game.
%Nonetheless, the intuition is to spend more time in parts of the tree that are relevant given the progression of the match. Unlike
%information set targeting, public subgame targeting should not reveal anything about private information.
\subsection{Algorithm}
\begin{algorithm2e}[t]
OOS$(h, \pi_i, \pi_{-i}, s_1, s_2, i)$: \;
\pushline
\If{$h \in Z$}{
\breturn $(1, \delta s_1 + (1-\delta)s_2, u_i(z))$\; \label{alg:terminal}
}
\ElsIf{$P(h) = c$}{
Sample an outcome $a$ and let $\rho_1,\rho_2$ be its probability in targeted and untargeted setting \; \label{alg:chancesample}
\breturn OOS$(ha, \pi_i, \rho_2 \pi_{-i}, \rho_1 s_1 , \rho_2 s_2, i)$ \;
}
$I \gets $ getInfoset$(h, P(h))$ \;
Let $(a,s_1',s_2') \leftarrow $ Sample$(h, I, i, \epsilon)$ \; \label{alg:sample}
\If{$I$ is not in memory}{
Add $I$ to memory \;
$\sigma(I) \leftarrow \mbox{Unif}(A(I))$ \;
$(x, l, u) \gets $ Playout$(ha, \frac{\delta s_1 + (1-\delta)s_2}{|A(I)|})$ \; \label{alg:playout}
}
\Else{
$\sigma(I) \gets $ RegretMatching$(r_I)$ \;
$\pi_{P(h)}' \gets \sigma(I,a)\pi_{P(h)}$ \; \label{alg:newreaches1}
$\pi_{-P(h)}' \gets \pi_{-P(h)}$ \; \label{alg:newreaches2}
% $(\pi_i', \pi_{-i}') \leftarrow$ NewReachProbs$(h, \pi_i, \pi_{-i}, i)$ \; \label{alg:newreaches}
$(x, l, u) \gets$ OOS$(ha, \pi_i', \pi_{-i}', s_1', s_2', i)$ \;
}
$c \gets x$ \; \label{alg:suffix1}
$x \gets x \sigma(I,a)$ \; \label{alg:suffix2}
\For{$a' \in A(I)$}{
\If{$P(h) = i$}{
$W \gets u \pi_{-i}~/~l$ \;
\If{$a' = a$}{
$r_I[a'] \gets r_I[a'] + (c - x)W$ \;
}
\Else{
$r_I[a'] \gets r_I[a'] - xW$ \;
}
}
\Else{
$s_I[a'] \gets s_I[a'] + \frac{1}{\delta s_1 + (1-\delta)s_2} \pi_{-i} \sigma(I,a')$ \; \label{alg:avgstrat}
}
}
\breturn $(x, l, u)$ \; \label{alg:returnend}
\popline
\vspace{0.1cm}
\caption{Online Outcome Sampling. \label{alg}}
\end{algorithm2e}
The algorithm is iterative and samples a single trajectory from the root $\emptyset$ to some
terminal history. At each information set in memory, $I$, there are two tables maintained: $r_I$ stores cumulative
regret for each action $a \in A(I)$, and $s_I$ stores the cumulative average strategy probability of each
action.
Depending on the targeting method that is chosen (IST or PST), $Z_{sub}$ is one of $Z_{I(\tth)}$ or $Z_{p,I(\tth)}$.
The pseudo-code is presented as Algorithm~\ref{alg}.
Each iteration is represented by two calls of OOS where the update player $i \in \{1,2\}$ is alternated.
Before each iteration, a {\it scenario} is decided:
with probability $\delta$ the iteration targets the subgame and chooses $z \in Z_{sub}$
and with probability $(1-\delta)$ the usual OS sampling determines $z \in Z$.
The first parameter of OOS is the current history.
The next two are strategy's reach probabilities for the update player $i$ and the opponent.
The third and fourth parameters are overall probabilities that the current sample is generated, one for each scenario: first the targeted and then the untargeted.
The last is the update player. Initial calls have the form OOS$(\emptyset, 1, 1, 1, 1, i)$.
For the return values, $x$ is a suffix/tail reach probability for both players,
$l$ is the root-to-leaf sample probability, and $u$ is the payoff of the trajectory in view
of the update player.
In outcome sampling, an $\epsilon$-on-policy sampling distribution used at each information set
is defined as
\begin{equation*}
\label{eq:ossample}
\Phi(I,i) = \left\{
\begin{array}{ll}
\epsilon \cdot \mbox{Unif}(A(I)) + (1-\epsilon)\sigma_i(I) & \mbox{if } P(I) = i\\
\sigma_i(I) & \mbox{otherwise,}
\end{array} \right.
\end{equation*}
and denote $\Phi(I,i,a)$ the probability of sampling $a \in A(I)$.
The sampling at chance's choices on line~\ref{alg:chancesample} depends on the method and the scenario being used. For example, when using information set targeting, the outcome that is sampled must be consistent with match history.
A critical part of the algorithm is the action chosen and sample reach updates on line~\ref{alg:sample}. If $I$ is not in memory, then an action is sampled uniformly. Otherwise, in the targeted scenario, the current history $h$ is always in the targeted part of the game and an action from $\{a~|~\exists z\in Z \; (ha,z)\in Z_{sub}\}$ is selected using the distribution $\Phi(I(h),i)$ normalized to one on this subset of actions. If we define $sum=\sum_{(ha,z)\in Z_{sub}}\Phi(I,i,a)$ then $s_1' = s_1\Phi(I,i,a)/sum$. In the untargeted scenario, any action $a \sim \Phi(I,i)$ can be sampled. If the action is not leaving the targeted part of the game (i.e., $(ha,z)\in Z_{sub}$) then $s_1' = s_1\Phi(I,i,a)/sum$ otherwise $s_1'=0$. In all cases $s_2' = \Phi(I,i,a) s_2$.
These sample reach probabilities are combined into one at a terminal history on line~\ref{alg:terminal},
start of the playout on line~\ref{alg:playout}, and when updating the average
strategy on line~\ref{alg:avgstrat}.
%In practice, the algorithm does not sample $z \in Z_{sub}$ directly, but rather on a per-action basis.
%There are two stages, the prefix stage ($h \sqsubset \tth$), and the tail stage (either $\tth \sqsubseteq h$ or $h$ is
%off the match path).
%In the prefix stage, $\cP_{sub}$ is a distribution
%that assigns some high probability $\gamma$ to sampling an action that is consistent with $\tth$. In the tail case, the
%sampling distribution is $\epsilon$-on-policy, so action $a \in A(I)$ is sample with probability
%$\epsilon/|A(I)| + (1-\epsilon)\sigma(I,a)$. Also, no exploration is done on opponent histories, so $\epsilon = 0$ when
%$P(h) = -i$.
The playout on line~\ref{alg:playout} samples to the end of the game with some playout policy at each step; we use uniform random,
but in general one could use a informed policy based on domain knowledge as well.
Unlike MCTS, the playout policy in OOS must compute $l$ when reaching a terminal and update the tail probability $x$ when returning
as done on line~\ref{alg:suffix2}. Lines~\ref{alg:newreaches1} and \ref{alg:newreaches2} modify $P(h)$'s reach probability
by multiplying it by $\sigma(I,a)$, keeping the other value the same.
%Lines~\ref{alg:suffix1} to \ref{alg:returnend} % second ref does not seem to work
Lines~\ref{alg:suffix1} to 24
contain the usual outcome sampling updates. Note that regrets are updated at the
update player histories, while average strategy tables at opponent histories.
%\mlnote{Still need to add some more detail to
%earlier sections so that this last part is not too confusing.}
\begin{theorem}
Let $\bar{\sigma}^t_m(\delta,\tth)$ be a strategy produced by OOS with scheme $m \in \{ \mbox{IST}, \mbox{PST} \}$
using $\delta < 1$ started from $\tth$ run for $t$ iterations, with exploration $\epsilon > 0$.
For any $p \in (0, 1], \varepsilon > 0$ there exists $t < \infty$ such that with
probability $1-p$ the strategy $\bar{\sigma}^t_m(\delta,\tth)$ is a $\varepsilon$-equilibrium strategy.
\label{thm:consistency}
\end{theorem}
Since every terminal history has non-zero probability of being sampled, eventually every information
set will be contained in memory. Then, the algorithm becomes MCCFR with a non-uniform sampling scheme.
Consequently by \cite[Theorem 5]{Lanctot09Sampling} OOS minimizes external regret with high probability.
Note that due to non-locality, this consistency property cannot hold generally for any search
algorithm that does not modify $\sigma(I)$ at previous $I(h)$ such that $h \sqsubset \tth$. However,
it is an open question as to whether any of the previous algorithms could be modified to ensure
consistency.
\section{Empirical Evaluation}
%\mlnote{This section is still quite rough. I've commented out much of the text for now.}
%Here are some ides for what we could look at empirically:
%\begin{itemize}
%\item Compare sampling techniques in each domain
%\item Quality of OOS as a function of the length of the match history $\tth$ (and time limits?)
%\item Robustness/sensitivity of $\delta_I$ and $\delta_{I,\tth}$ with observed convergence,
%\item Importance of ``retaining the tree'' from previous searches along the same match $z$ (suspected to be high)
%\item Performance comparisons (win rates and exploitability) between sampling techniques and some state-of-the-art
%algorithms in imperfect information search (ISMCTS, PIMC, MMCTS),
%possibly also show exploitation / exploitability trade-offs as in RNash work
%\item (probably too much for this one paper) What if there are a sequence of multiple matches and we can reuse computation from
%previous matches?
%\item How does the story change with or without \{ incremental tree-building, targeting, retaining memory between searches \}
%\end{itemize}
%
% How does the story change:
% - with/without targeting (as a function of delta)
% - with/without incremental tree-building
% - with/without remembering between moves
We now compare the head-to-head performance and exploitability
of OOS and ISMCTS on two games. % with different sources of imperfect information.
%We show the effects of targeting methods and $\delta$ values.
%We start by describing each game, whose utilities are in $\{-1, 0, 1\}$.
Liar's Dice, LD($D_1$,$D_2$), also known as Dudo, Perudo, and Bluff is a dice-bidding game.
Each die has six sides with faces \epsdice{1} to \epsdice{5} and a star $\star$.
Each player $i$ rolls $D_i$ of these dice and looks at them without showing them to their opponent.
Each round, players alternate by bidding on the outcome of all dice in play until one player ``calls liar'',
\ie claims that their opponent's latest bid does not hold.
A bid consists of a quantity of dice and a face value.
A face of $\star$ is considered wild and counts as matching any other face.
To bid, the player must increase either the quantity or face value of the current
bid (or both). The losing player discards a number of dice equal to how many dice were missing
to have a valid bid. The players continue until one player has no more dice.
Imperfect Information II-Goofspiel($N$) is a two-player card game where each player is
given a private hand of bid cards with values $1$ to $N$. A different
deck of $N$ point cards is placed face up in a stack
On their turn, each player bids for the top point card by
choosing a single card in their hand.
The highest bidder gets the point card and adds the point total to their score, discarding
the points in the case of a tie.
This is repeated $N$ times and the winner is the player with the highest score.
In, II-Goofspiel the players only discover who won or lost a bid but not the bid cards played.
Also, we assume the point-stack is strictly increasing: 1, 2, $\ldots N$.
%Phantom Tic-Tac-Toe (PTTT) is played on a 3-by-3 board,
%initially empty, where the goal is to claim three squares along the same row, column, or diagonal.
%However, in PTTT, players' actions are private.
%Each turn, a player attempts to take a square of their choice.
%If they fail due to the opponent having taken that square on a previous turn, the same player
%keeps trying to take an alternative square until they succeed.
%Players are not informed about how many attempts the opponent made before succeeding.
%The game ends immediately if there is ever a connecting line of squares belonging to the same player.
%The winner receives a payoff of $+1$, while the losing player receives $-1$.
We will focus our experiments on LD(1,1) and II-Goofspiel(6). While these games are considered small by
search algorithm standards, it is still possible to compute best response strategies to measure exploitability,
allowing us to show the observed convergence of the strategies
produced by OOS. We include preliminary results in LD(2,2).
To improve performance against irrational play, we use a more explorative regret matching,
$\sigma^{T+1}_\gamma(I,a) = \gamma/|A(I)| + (1-\gamma) \sigma^{T+1}(I,a)$, with $\gamma = 0.01$.
While this could effect convergence, we observe in our experiments that exploitability
decreases as search time increases.
\subsection{Head-to-Head Performance versus Exploitability}
\begin{figure}[t!]
\begin{center}
\includegraphics[scale=0.55]{plots/ismcts-oos-perf} \\
\includegraphics[scale=0.55]{plots/eq-oos-perf} \\
\includegraphics[scale=0.55]{plots/oos-expl} \\
\includegraphics[scale=0.55]{plots/goof-expl} \\
\caption{Results for OOS in LD(1,1). From top: win rate (\%) of OOS vs. ISMCTS, win rate (\%) of
OOS vs. a 0.001-equilibrium, approximate $\epsilon_{\sigma}$ using aggregate method in LD(1,1)
and II-Goofpiel(6). }
\label{fig:results}
\end{center}
\end{figure}
%Here are some first results for performance and/or exploitability analysis of the algorithms.
%Note that unless otherwise stated, memory is retained and not cleared between successive searches
%(which has no effect in ISMCTS).
%\mlnote{I have played against the ISMCTS strategy and I believe it's converging to a particular
%pure (and bad) strategy.}
%We evaluate head-to-head performance by simulating several games of each algorithm against the other
%and counting wins and losses.
In games like Poker and Liar's Dice, it is often critical to play in such a way that the opponent
cannot easily infer the private information. This explains partly why CFR-based methods have
enjoyed so much success in the offline approach.
In the online setting, however, since the tree is built incrementally, only partial strategies
are produced.
We are unaware of any methods for assessing the worst-case exploitability of strategies
produced by an online search algorithm.
We therefore propose two new methods to approximate the exploitability of the produced strategies.
In the offline setting, measuring exploitability can be done by a recursive walk of the game
tree using expectimax. In online search, the algorithm computes only
a partial strategy.
The first \defword{full stitching} method
enumerates each $I \in \cI_i$ in topologically-sorted order starting at the root,
%tries to capture as closely as possible the distributions that would be computed
%for each $I$ if they were reached in a match:
running a search from $I$ re-using only the information computed in previous searches from ancestors of $I$, saving the
distribution computed at $I$, and passing down the state of memory only for children of $I$.
We do not save changes made to ancestors when searching at $I$ to ensure
a fair comparison between OOS and ISMCTS. Full stitching provides the best representation of a full strategy
that would eventually be produced by OOS since it builds distributions at each information set in the
same way as OOS would if they were reached during play.
However, full-stitching requires $|\cI|$ searches and memory, which is impractical on large games.
We also propose a the multi-match \defword{aggregate method}.
This method first creates a global (accumulating) strategy data structure for each player type and generates a
set of matches $M$. Then, each $m \in M$ is simulated invoking the appropriate search algorithm at each observed
$I$ along $m$.
Since $m$ is predetermined, the choice made by the search algorithm is discarded, but the information computed
(visit counts in ISMCTS, $s_I$ in OOS) is added into the global strategy data structure belonging to the player
who searched.
For a fair comparison, the first $I$ reached along each $m$ aggregates all the information gained in the search
but for future $I'$, only the information collected in each $I''$ reachable by $I'$ is aggregated.
Note that it is safe to combine the information in this way: in ISMCTS the actions chosen and visits are independent of
how $I'$ was reached. In OOS, the accumulating $s_I$ values of two converging $\epsilon$-equilibrium average
strategies can be combined due to linearity of expected utility.
%a the \defword{partial stitching} method which does the same as full stitching to a fixed depth from the root, except also
%retains values from children information sets reached by the searches started along the frontier of the depth limit.
% Probably will ditch this single-match method when partial stitching is working.
%And the \defword{single-match} method: this simply runs several matches and saves all values from each $I$ searched (as well as all of its reachable
%children information sets) search, overwriting previously saved values as necessary.
In our experiments ISMCTS uses $C = 2$; tuning shows that the $C$ value does not
affect the performance.
In II-Goofspiel(6) we run searches for 1, 2, and 4 seconds.
As there are no public actions, we only compare IST in II-Goofspiel.
Over a range of values for $\delta$ and all time settings, there was no statistically significant
winner in head-to-head matches between IST and ISMCTS.
Exploitability is shown in Figure~\ref{fig:results}.
For reference, the exploitability values of ISMCTS was 0.95, 0.96, 0.91 for search times of
1, 2, and 4 seconds. We observe that exploitability generally
decreases as search time increases and changes little as $\delta$ increases, with its
lowest points at $\delta = 1$.
This result was surprising, and we suspect it is due OOS benefiting from the reuse of
values from previous searches in the same match.
In LD(1,1), we try searches for 1, 5, and 25 seconds.
When played against a uniform random player, among all values for delta and time settings,
IST wins 79.2-86.7\% and PST wins 78.6-84.0\%, with a non-noticeable differences across $\delta$ values.
ISMCTS beat uniform random 79.8-82.4\% at 1, 5, and 25 seconds.
Upon further inspection, ISMCTS very quickly converges to the same strategy every time: as first player,
with a weak roll ($r = \epsdice{1}, \epsdice{2},$ or \epsdice{3}) it bids 2-$r$ in the hope that by chance the
opponent has the same roll because if it did not, the second player would have a winning response most of the time.
On a roll of $r = \epsdice{4}$ or $r = \epsdice{5}$ it always bids 1-$r$ because it wins most of time. Either way, as
second player ISMCTS selects responses based on the same reasoning, inducing the first player's roll based on their
first bid. This also explains the relative exploitability values in Table~\ref{tbl:fullstitching}: the first player
plays overly safely and hence is hard to exploit, meanwhile the second player best-responds to an expected pure first
player strategy, which makes it highly exploitable.
Our exploitability experiments shows a similar skewed first vs. second player effect in II-Goofspiel.
Results for OOS variants on LD(1,1) are shown in Figure~\ref{fig:results}. Firstly, in head-to-head performance
we notice OOS wins 55-60\% of games against ISMCTS, results for $\delta \in \{ 0.1, 0.8, 0.9 \}$ seem somewhat
consistent across variants and time settings, with varied effects for $\delta > 0.9$.
Against the 0.001-equilibrium strategy,
$\delta = 0.5, 0.65$ seems the most robust across variants and time settings, with varied effects when $\delta > 0.9$.
Exploitability of ISMCTS strategies computed by the multi-match method was 0.88, 0.82, 0.79 at search times of
1, 2, and 25 seconds. There are some clear trends for the exploitability of OOS strategies.
First, there is an observed decrease in exploitability as time increases, independent of $\delta$.
When $\delta$ is varied at the same time setting, the exploitability of the global strategy increases with $\delta$.
Then, when $\delta = 1$ every time setting for IST converges to the same highly-exploitable point, illustrating the
theoretical problems with full targeting manifesting in practice.
Interestingly, this is not the case for PST, where $\delta = 1$ is actually always the best-case scenario.
In Liar's Dice, this is sensible because there is no reason to believe that any histories outside the public subgame
will effect the decisions being made when in the public subgame.
The sudden drop from $\delta = 0.99$ to $1$ could be caused by the fact that as $\delta$ increases the
variance of the importance-sampling corrections of the off-match samples grows; when $\delta = 1$ the
sampled counterfactual values may be biased, but have much lower variance.
PST is less affected than IST by the targeting and appears to have lower exploitability.
For each method and all values of $\delta$: increasing the search time decreases exploitability.
The exploitability values from Table~\ref{tbl:fullstitching} also show this trend.
We performed one head-to-head experiment consisting of 500 games of PST($\delta = 0.8$) vs. ISMCTS, in LD(2,2)
with 25 seconds of search time per move.
PST won 256 games, showing a competitive promise on
this much larger game with roughly 352 million information sets.
%Finally, inspired by~\cite{Auger11Multiple}, we started an initial investigation of OOS and ISMCTS in Phantom Tic-Tac-Toe.
%We noticed that the mixed distribution obtained by normalizing the visit counts of ISMCTS was producing a
%similar distribution as OOS on the second-player's first action. As in the Rock, Paper, Scissors
%example~\cite{Shafiei09}, we designed a biased version that gave higher payoff if the final line
%included a side. This modification led to OOS and ISMCTS producing two distinctly different distributions at the second
%player's first action.
\begin{table}
{\small
\begin{center}
\begin{tabular}{ccccc}
Algorithm & Time & $\epsilon_1$ & $\epsilon_2$ & $\epsilon_\sigma$ \\
\hline
ISMCTS & 1s & 0.235 & 0.574 & 0.808 \\
IST & 1s & 0.337 & 0.311 & 0.648 \\
PST & 1s & 0.203 & 0.211 & 0.414 \\
\hline
ISMCTS & 5s & 0.251 & 0.548 & 0.799 \\
IST & 5s & 0.229 & 0.295 & 0.524 \\
PST & 5s & 0.148 & 0.125 & 0.273 \\
\hline
\end{tabular}
\caption{LD(1,1) exploitability using full-stitching, $\delta = 0.9$.}
\label{tbl:fullstitching}
\end{center}
}
\end{table}
\section{Conclusion}
In the paper, we have introduced Online Outcome Sampling, the first Monte Carlo
Tree Search algorithm that is guaranteed to produce
approximate equilibrium strategies as search time per move is increased in imperfect information games.
We showed that in head-to-head performance, OOS is able to compete with
ISMCTS while producing strategies with lower exploitability
at the same search time in Liar's Dice and II-Goofspiel.
We propose two methods for targeting relevant parts of the game based on the current match history,
IST and PST.
In II-Goofspiel, results are only slightly affected by different targeting probabilities, whereas the
effect is stronger in LD(1,1), with PST seeming better overall.
In future work, we hope to investigate more and larger games such as
no-limit poker, and the effect of informed playout policies.
Ultimately, we want to make practical comparisons to other
baseline algorithms such as
PIMC~\cite{Long10Understanding}, MMCTS~\cite{Auger11Multiple}, and IIMC~\cite{Furtak13Recursive}.
One interesting question is whether the subgame decomposition ideas of~\cite{Burch14Solving} could be
adapted to the online search setting.
Finally, using MCRNR~\cite{Ponsen11Computing} as the base algorithm could provide a balance between
exploitability and exploitation against known opponents.
\vspace{0.2cm}
{\small \noindent {\bf Acknowledgments.} This work is partially funded by the Netherlands Organisation for Scientific
Research (NWO) in the framework of the project Go4Nature, grant number 612.000.938 and the Czech
Science Foundation, grant no. P202/12/2054.}
\bibliographystyle{aaai}
\bibliography{iioos}
\end{document}