Mastering the game of Go withdeep neural networks and tree search


Author: David Silver1*, Aja Huang1*, Chris J. Maddison1, ArthurGuez1, Laurent Sifre1, George van den Driessche1, Julian Schrittwieser1,Ioannis Antonoglou1, Veda Panneershelvam1, Marc Lanctot1, Sander Dieleman1,Dominik Grewe1, John Nham2, Nal Kalchbrenner1, Ilya Sutskever2, TimothyLillicrap1, Madeleine Leach1, Koray Kavukcuoglu1, Thore Graepel1 & DemisHassabis1

作者:①戴维·斯尔弗1*,②黄士杰1*,③克里斯·J.·麦迪逊1,④亚瑟·格斯1,⑤劳伦特·西弗瑞1,⑥乔治·范登·德里施1,⑦朱利安·施立特威泽1,⑧扬尼斯·安东诺娄1,⑨吠陀·潘聂施尔万1,⑩马克·兰多特1,⑪伞德·迪勒曼1,⑫多米尼克·格鲁1,⑬约翰·纳姆2,⑭纳尔 卡尔克布伦纳1,⑮伊利亚·萨茨基弗2,⑯蒂莫西·李烈克莱普1,⑰马德琳·里奇1,⑱科瑞·卡瓦口格鲁1,⑲托雷·格雷佩尔1,和⑳戴密斯·哈萨比斯1



Abstract:The game of Go has long beenviewed as the most challenging of classic games for artificial intelligenceowing to its enormous search space and the difficulty of evaluating boardpositions and moves. Here we introduce a new approach to computer Go that uses‘value networks’ to evaluate board positions and ‘policy networks’ to selectmoves. These deep neural networks are trained by a novel combination ofsupervised learning from human expert games, and reinforcement learning fromgames of self-play. Without any lookahead search, the neural networks play Goat the level of stateof-the-art Monte Carlo tree search programs that simulatethousands of random games of self-play. We also introduce a new searchalgorithm that combines Monte Carlo simulation with value and policy networks.Using this search algorithm, our program AlphaGoachieved a 99.8% winning rate against other Go programs, and defeated the humanEuropean Go champion by 5 games to 0. This is the first time that a computerprogram has defeated a human professional player in the full-sized game of Go,a feat previously thought to be at least a decade away.

(译注1:论文有15部分:0摘要Abstact、1导言Introduction、2策略网络的监督学习Supervised learningof policy networks、3策略网络的强化学习ReinforcementLearning of Policy Networks、4估值网络的强化学习ReinforcementLearning of Value Networks、5基于策略网络和估值网络的搜索算法Searching with Policyand Value Networks、6AlphaGo博弈算力评估Evaluating theplaying strength of AlphaGo、7讨论Discussion(参考文献References1-38)、8方法METHODS(9参考文献References39-62)、10致谢Acknowledgements、11作者信息Author Information(作者贡献Author Contributions)、12扩展数据Extended data(扩展数据图像和表格Extended Data Table)、13补充信息Supplementaryinformation(权力和许可Rights andpermissions、文章相关About this article、延伸阅读Further reading)和14评论Comments。其中,9参考资料References正文和讨论部分38篇,方法部分24篇,合计有62篇。另外,自然期刊在线资料还包括论文PDF、6张大图、6个对弈PPT和1个补充信息压缩包,论文网址。新媒体文章省略第8部分等,全部资料ZIP压缩包,在本社区可下载。)








Figure 1: Neural network trainingpipeline and architecture. a A fastrollout policy p_ and supervised learning (SL) policy network are trained to predict human expert moves in a data-setof positions. A reinforcement learning (RL) policy network is initialised to the SL policy network, and is thenimproved by policy gradient learning to maximize the outcome (i.e. winning moregames) against previous versions of the policy network. A new data-set isgenerated by playing games of self-play with the RL policy network. Finally, avalue network is trained by regression topredict the expected outcome (i.e. whether the current player wins) inpositions from the selfplay data-set. bSchematic representation of the neural network architecture used in AlphaGo. The policy network takes arepresentation of the board position s as its input, passes it through manyconvolutional layers with parameters σ(SL policy network) or ρ(RL policy network), and outputs a probabilitydistribution pσ(a|s) or pρ(a|s) over legal moves a,represented by a probability map over the board. The value network similarlyuses many convolutional layers with parameters θ, but outputs a scalar value (sʹ) that predicts the expected outcome in position sʹ.

All games of perfect informationhave an optimal value function, v*(s), which determines the outcome of thegame, from every board position or state s, under perfect play by all players.These games may be solved by recursively computing the optimal value functionin a search tree containing approximately bd possible sequences ofmoves, where b is the game’s breadth (number of legal moves per position) and d is its depth (game length).In large games, such as chess (b≈35; d≈80)1and especially Go (b≈250; d≈150)1, exhaustive search isinfeasible2, 3, but the effective search space can be reduced by twogeneral principles. First, the depth of the search may be reduced by position evaluation:truncating the search tree at state s and replacing the subtree below s by anapproximate value function v(s)≈v*(s) that predicts the outcome from state s.This approach has led to super-human performance in chess4, checkers5and othello6, but it was believed to be intractable in Go due to thecomplexity of the game7. Second, the breadth of the search may be reduced bysampling actions from a policy p(a|s) that is a probability distribution overpossible moves a in position s. For example, Monte-Carlo rollouts8search to maximum depthwithout branching at all, by sampling long sequences of actions for bothplayers from a policy p. Averaging over such rollouts can provide an effectiveposition evaluation, achieving super-human performance in backgammon8and Scrabble9, and weak amateur level play in Go10.

Monte-Carlo tree search (MCTS)11, 12uses Monte-Carlo rollouts to estimate the value of eachstate in a search tree. As more simulations are executed, the search tree growslarger and the relevant values become more accurate. The policy used to selectactions during search is also improved over time, by selecting children withhigher values. Asymptotically, this policy converges to optimal play, and theevaluations converge to the optimal value function12. The strongest current Goprograms are based on MCTS, enhanced by policies that are trained to predicthuman expert moves13. These policies are used to narrow the search to a beamof high probability actions, and to sample actions during rollouts. Thisapproach has achieved strong amateur play13–15. However, prior work has been limited to shallowpolicies13–15or value functions16based on a linear combination of input features.

Recently, deep convolutional neural networks haveachieved unprecedented performance in visual domains: for example imageclassification17, face recognition18, and playing Atari games19. They use many layers ofneurons, each arranged in overlapping tiles, to construct increasinglyabstract, localised representations of an image20. We employ a similararchitecture for the game of Go. We pass in the board position as a 19×19 image and use convolutionallayers to construct a representation of the position. We use these neuralnetworks to reduce the effective depth and breadth of the search tree:evaluating positions using a value network, and sampling actions using a policynetwork.

We train the neural networks using a pipeline consistingof several stages of machine learning (Figure 1). We begin by training asupervised learning (SL) policy network, pσ, directly from experthuman moves. This provides fast, efficient learning updates with immediatefeedback and high quality gradients. Similar to prior work13, 15, we also train a fast policypπthat can rapidly sample actions during rollouts. Next, wetrain a reinforcement learning (RL) policy network, pρ, thatimproves the SL policy network by optimising the final outcome of games ofself-play. This adjusts the policy towards the correct goal of winning games,rather than maximizing predictive accuracy. Finally, we train a value network vθthat predicts the winner ofgames played by the RL policy net
