Statement

Lemma

Given a Markov decision process , let be the value estimate for a state for the ‘th state. If we update this using the following update rule:

where is a noisy sample of the true value with noise of mean 0, and is a learning rate. Then the incrementally learned will converge in the limit provided that for every state is visited infinitely often:

  1. The sum of the learning rates diverge: , and
  2. The sum of the squared learning rates converges: .

Proof