Statement
Lemma
Given a Markov decision process
, let be the value estimate for a state for the ‘th state. If we update this using the following update rule: where
is a noisy sample of the true value with noise of mean 0, and is a learning rate. Then the incrementally learned will converge in the limit provided that for every state is visited infinitely often:
- The sum of the learning rates diverge:
, and - The sum of the squared learning rates converges:
.