Accession Number : ADA321555

Title :   A Response to Bertsekas' A Counterexample to Temporal-Differences Learning.

Descriptive Note : Final rept. 21 Mar-22 May 96,

Corporate Author : WRIGHT LAB WRIGHT-PATTERSON AFB OH

Personal Author(s) : Harmon, Mance E. ; Baird, Leemon C.

PDF Url : ADA321555

Report Date : 22 NOV 1996

Pagination or Media Count : 10

Abstract : For an absorbing Markov chain with a reinforcement on each transition, Bertsekas (1995a) gives a simple example where the function learned by TD(lambda) depends on lambda. Bertsekas showed that for lambda=1 the approximation is optimal with respect to a least-squares error of the value function, and that for lambda=0 the approximation obtained by the TD method is poor with respect to the same metric. With respect to the error in the values, TD(1) approximates the function better than TD(0). However; with respect to the error in the differences in the values, TD(0) approximates the function better than TD(1). TD(1) is only better than TD(0) with respect to the former metric rather than the latter. In addition, direct TD(lambda) weights the errors unequally, while residual gradient methods (Baird, 1995, Harmon, Baird, & Klopf, 1995) weight the errors equally. For the case of control, a simple Markov decision process is presented for which direct TD(0) and residual gradient TD(0) both learn the optimal policy, while TD(1) learns a suboptimal policy. These results suggest that, for this example, the differences in state values are more significant than the state values themselves, so TD(0) is preferable to TD(1).

Descriptors :   *MATHEMATICAL MODELS, *MARKOV PROCESSES, ALGORITHMS, OPTIMIZATION, LEARNING MACHINES, APPROXIMATION(MATHEMATICS), ERROR ANALYSIS, LEAST SQUARES METHOD, STATISTICAL DECISION THEORY.

Subject Categories : Statistics and Probability
      Operations Research

Distribution Statement : APPROVED FOR PUBLIC RELEASE