I am working on the Agent class in Python 2.7.11, which uses Markov Decision Making (MDP) to find the optimal Ο policy in GridWorld . I am doing a basic iteration of a value for 100 iterations of all GridWorld states using the following Bellman equation:

- T (s, a, s ') is the probability function of a successful transition to the state of the successor s' from the current state s action a .
- R (s, a, s ') - reward for the transition from s to s' .
- ? gamma; (gamma) is the discount coefficient, where 0 & leq; γ & Leq; 1 .
- V k (s ') is a recursive call to repeat the calculation after reaching s' .
- V k + 1 (s) is representative, as after a sufficient number of iterations k has occurred, V k will converge and become equivalent to V k + 1
This equation is obtained from taking the maximum of the function Q , which I use in my program:

When building my Agent it is passed MDP, which is an abstract class containing the following methods:
# Returns all states in the GridWorld def getStates()
My Agent also passes the discount factor and the number of iterations. I also use dictionary to track my values. Here is my code:
class IterationAgent: def __init__(self, mdp, discount = 0.9, iterations = 100): self.mdp = mdp self.discount = discount self.iterations = iterations self.values = util.Counter()
This implementation is true, although I'm not sure why I need valuesCopy to successfully update my self.values dictionary. I tried the following to omit copying, but it doesnβt work , since it returns several incorrect values:
for i in range(0, self.iterations, 1): states = self.mdp.getStates() for state in states: legalMoves = self.mdp.getPossibleActions(state) convergedValue = 0 for move in legalMoves: value = self.computeQValueFromValues(state, move) if convergedValue <= value or convergedValue == 0: convergedValue = value self.values.update({state: convergedValue})
My question is why does it include a copy of my self.values dictionary, which is necessary to correctly update my values, when valuesCopy = self.values.copy() makes a copy of the dictionary anyway at each iteration? Shouldn't the values ββin the original result be updated in the same update?