Summary

  • Iterative policy evaluation is an algorithm used in the dynamic programming setting to estimate the state-value function corresponding to a policy . In this approach, a Bellman update is applied to the value function estimate until the changes to the estimate are nearly imperceptible.

  • Estimation of Action Values: In the dynamic programming setting, it is possible to quickly obtain the action-value function from the state-value function

  • Policy improvement takes an estimate of the action-value function corresponding to a policy , and returns an improved (or equivalent) policy , where . The algorithm first constructs the action-value function estimate . Then, for each state , you need only select the action that maximizes . In other words, for all

  • Policy iteration is an algorithm that can solve an MDP in the dynamic programming setting. It proceeds as a sequence of policy evaluation and improvement steps, and is guaranteed to converge to the optimal policy (for an arbitrary finite MDP).

  • Truncated policy iteration is an algorithm used in the dynamic programming setting to estimate the state-value function corresponding to a policy . In this approach, the evaluation step is stopped after a fixed number of sweeps through the state space. We refer to the algorithm in the evaluation step as truncated policy evaluation.

  • Value iteration is an algorithm used in the dynamic programming setting to estimate the state-value function corresponding to a policy . In this approach, each sweep over the state space simultaneously performs policy evaluation and policy improvement.