reinforcement learning.ppt-道客多多

资源描述

1、Reinforcement Learning,Peter Bodk cs294-34,Previous Lectures,Supervised learning classification, regressionUnsupervised learning clustering, dimensionality reductionReinforcement learning generalization of supervised learning learn from interaction w/ environment to achieve a goal,environment,agent,

2、action,reward new state,Today,examplesdefining a Markov Decision Process solving an MDP using Dynamic ProgrammingReinforcement Learning Monte Carlo methods Temporal-Difference learningmiscellaneous state representation function approximation rewards,Robot in a room,actions: UP, DOWN, LEFT, RIGHTUP80

3、% move UP 10% move LEFT 10% move RIGHT,reward +1 at 4,3, -1 at 4,2 reward -0.04 for each stepwhats the strategy to achieve max reward? what if the actions were deterministic?,Other examples,pole-balancing walking robot (applet) TD-Gammon Gerry Tesauro helicopter Andrew Ngno teacher who would say “go

4、od” or “bad” is reward “10” good or bad? rewards could be delayedexplore the environment and learn from the experience not just blind search, try to be smart about it,Outline,examplesdefining a Markov Decision Process solving an MDP using Dynamic ProgrammingReinforcement Learning Monte Carlo methods

5、 Temporal-Difference learningmiscellaneous state representation function approximation rewards,Robot in a room,actions: UP, DOWN, LEFT, RIGHTUP80% move UP 10% move LEFT 10% move RIGHTreward +1 at 4,3, -1 at 4,2 reward -0.04 for each step,states actions rewardswhat is the solution?,Is this a solution

6、?,only if actions deterministic not in this case (actions are stochastic)solution/policy mapping from each state to an action,Optimal policy,Reward for each step -2,Reward for each step: -0.1,Reward for each step: -0.04,Reward for each step: -0.01,Reward for each step: +0.01,Markov Decision Process

7、(MDP),set of states S, set of actions A, initial state S0 transition model P(s|s,a) P( 1,2 | 1,1, up ) = 0.8 Markov assumption reward function r(s) r( 4,3 ) = +1 goal: maximize cumulative reward in the long runpolicy: mapping from S to A(s) or (s,a)reinforcement learning transitions and rewards usua

8、lly not available how to change the policy based on experience how to explore the environment,environment,agent,action,reward new state,Computing return from rewards,episodic (vs. continuing) tasks “game over” after N steps optimal policy depends on N; harder to analyzeadditive rewards V(s0, s1, ) =

9、 r(s0) + r(s1) + r(s2) + infinite value for continuing tasksdiscounted rewards V(s0, s1, ) = r(s0) + *r(s1) + 2*r(s2) + value bounded if rewards bounded,Value functions,state value function: V(s) expected return when starting in s and following state-action value function: Q(s,a) expected return whe

10、n starting in s, performing a, and following useful for finding the optimal policy can estimate from experience pick the best action using Q(s,a)Bellman equation,s,a,s,r,Optimal value functions,theres a set of optimal policies V defines partial ordering on policies they share the same optimal value

11、functionBellman optimality equationsystem of n non-linear equations solve for V*(s) easy to extract the optimal policyhaving Q*(s,a) makes it even simpler,Outline,examplesdefining a Markov Decision Process solving an MDP using Dynamic ProgrammingReinforcement Learning Monte Carlo methods Temporal-Di

12、fference learningmiscellaneous state representation function approximation rewards,Dynamic programming,main idea use value functions to structure the search for good policies need a perfect model of the environmenttwo main components policy evaluation: compute V from policy improvement: improve base

13、d on Vstart with an arbitrary policy repeat evaluation/improvement until convergence,Policy evaluation/improvement,policy evaluation: - V Bellman eqns define a system of n eqns could solve, but will use iterative versionstart with an arbitrary value function V0, iterate until Vk convergespolicy impr

14、ovement: V - either strictly better than , or is optimal (if = ),Policy/Value iteration,Policy iterationtwo nested iterations; too slow dont need to converge to Vk just move towards itValue iterationuse Bellman optimality equation as an update converges to V*,Using DP,need complete model of the envi

15、ronment and rewards robot in a room state space, action space, transition modelcan we use DP to solve robot in a room? back gammon? helicopter?DP bootstraps updates estimates on the basis of other estimates,Outline,examplesdefining a Markov Decision Process solving an MDP using Dynamic ProgrammingRe

16、inforcement Learning Monte Carlo methods Temporal-Difference learningmiscellaneous state representation function approximation rewards,Monte Carlo methods,dont need full knowledge of environment just experience, or simulated experienceaveraging sample returns defined only for episodic tasksbut simil

17、ar to DP policy evaluation, policy improvement,Monte Carlo policy evaluation,want to estimate V(s) = expected return starting from s and following estimate as average of observed returns in state sfirst-visit MC average returns following the first visit to state s,s0,s,s,+1,-2,0,+1,-3,+5,R1(s) = +2,

18、Monte Carlo control,V not enough for policy improvement need exact model of environment estimate Q(s,a)MC controlupdate after each episodenon-stationary environmenta problem greedy policy wont explore all actions,Maintaining exploration,key ingredient of RLdeterministic/greedy policy wont explore al

19、l actions dont know anything about the environment at the beginning need to try all actions to find the optimal onemaintain exploration use soft policies instead: (s,a)0 (for all s,a)-greedy policy with probability 1- perform the optimal/greedy action with probability perform a random actionwill kee

20、p exploring the environment slowly move it towards greedy policy: - 0,Simulated experience,5-card draw poker s0: A, A, 6, A, 2 a0: discard 6, 2 s1: A, A, A, A, 9 + dealer takes 4 cards return: +1 (probably)DP list all states, actions, compute P(s,a,s) P( A,A,6,A,2, 6,2, A,9,4 ) = 0.00192 MC all you

21、need are sample episodes let MC play against a random policy, or itself, or another algorithm,Summary of Monte Carlo,dont need model of environment averaging of sample returns only for episodic taskslearn from: sample episodes simulated experiencecan concentrate on “important” states dont need a ful

22、l sweep no bootstrapping less harmed by violation of Markov propertyneed to maintain exploration use soft policies,Outline,examplesdefining a Markov Decision Process solving an MDP using Dynamic ProgrammingReinforcement Learning Monte Carlo methods Temporal-Difference learningmiscellaneous state rep

23、resentation function approximation rewards,Temporal Difference Learning,combines ideas from MC and DP like MC: learn directly from experience (dont need a model) like DP: bootstrap works for continuous tasks, usually faster then MCconstant-alpha MC: have to wait until the end of episode to updatesim

24、plest TD update after every step, based on the successor,MC vs. TD,observed the following 8 episodes: A 0, B 0 B 1 B 1 B - 1 B 1 B 1 B 1 B 0MC and TD agree on V(B) = 3/4MC: V(A) = 0 converges to values that minimize the error on training dataTD: V(A) = 3/4 converges to ML estimate of the Markov proc

25、ess,Sarsa,again, need Q(s,a), not just V(s)control start with a random policy update Q and after each step again, need -soft policies,rt,rt+1,Q-learning,previous algorithms: on-policy algorithms start with a random policy, iteratively improve converge to optimalQ-learning: off-policy use any policy

26、to estimate QQ directly approximates Q* (Bellman optimality eqn) independent of the policy being followed only requirement: keep updating each (s,a) pairSarsa,Outline,examplesdefining a Markov Decision Process solving an MDP using Dynamic ProgrammingReinforcement Learning Monte Carlo methods Tempora

27、l-Difference learningmiscellaneous state representation function approximation rewards,State representation,pole-balancing move car left/right to keep the pole balancedstate representation position and velocity of car angle and angular velocity of polewhat about Markov property? would need more info

28、 noise in sensors, temperature, bending of polesolution coarse discretization of 4 state variables left, center, right totally non-Markov, but still works,Function approximation,until now, state space small and discrete represent Vt as a parameterized function linear regression, decision tree, neura

29、l net, linear regression: update parameters instead of entries in a table better generalization fewer parameters and updates affect “similar” states as wellTD updatetreat as one data point for regression want method that can learn on-line (update after each step),x,y,Features,tile coding, coarse cod

30、ing binary featuresradial basis functions typically a Gaussian between 0 and 1 Sutton & Barto, Reinforcement Learning ,Splitting and aggregation,want to discretize the state space learn the best discretization during trainingsplitting of state space start with a single state split a state when diffe

31、rent parts of that state have different valuesstate aggregation start with many states merge states with similar values,Designing rewards,robot in a maze episodic task, not discounted, +1 when out, 0 for each stepchess GOOD: +1 for winning, -1 losing BAD: +0.25 for taking opponents pieces high rewar

32、d even when loserewards rewards indicate what we want to accomplish NOT how we want to accomplish itshaping positive reward often very “far away” rewards for achieving subgoals (domain knowledge) also: adjust initial policy or initial value function,Case study: Back gammon,rules 30 pieces, 24 locati

33、ons roll 2, 5: move 2, 5 hitting, blocking branching factor: 400implementation use TD() and neural nets 4 binary features for each position on board (# white pieces) no BG expert knowledgeresults TD-Gammon 0.0: trained against itself (300,000 games) as good as best previous BG computer program (also

34、 by Tesauro) lot of expert input, hand-crafted features TD-Gammon 1.0: add special features TD-Gammon 2 and 3 (2-ply and 3-ply search) 1.5M games, beat human champion,Summary,Reinforcement learning use when need to make decisions in uncertain environment actions have delayed effectsolution methods dynamic programming need complete modelMonte Carlo time difference learning (Sarsa, Q-learning)simple algorithms most work designing features, state representation, rewards,www.cs.ualberta.ca/sutton/book/the-book.html,

展开阅读全文