Why is a shallow copy necessary to properly update my value dictionary?

Question

Why is a shallow copy necessary to properly update my value dictionary?

I am working on the Agent class in Python 2.7.11, which uses Markov Decision Making (MDP) to find the optimal π policy in GridWorld . I am doing a basic iteration of a value for 100 iterations of all GridWorld states using the following Bellman equation:

T (s, a, s ') is the probability function of a successful transition to the state of the successor s' from the current state s action a .
R (s, a, s ') - reward for the transition from s to s' .
? gamma; (gamma) is the discount coefficient, where 0 & leq; γ & Leq; 1 .
V _k (s ') is a recursive call to repeat the calculation after reaching s' .
V _{k + 1} (s) is representative, as after a sufficient number of iterations k has occurred, V _k will converge and become equivalent to V _{k + 1}

This equation is obtained from taking the maximum of the function Q , which I use in my program:

When building my Agent it is passed MDP, which is an abstract class containing the following methods:

 # Returns all states in the GridWorld def getStates() # Returns all legal actions the agent can take given the current state def getPossibleActions(state) # Returns all possible successor states to transition to from the current state # given an action, and the probability of reaching each with that action def getTransitionStatesAndProbs(state, action) # Returns the reward of going from the current state to the successor state def getReward(state, action, nextState)

My Agent also passes the discount factor and the number of iterations. I also use dictionary to track my values. Here is my code:

 class IterationAgent: def __init__(self, mdp, discount = 0.9, iterations = 100): self.mdp = mdp self.discount = discount self.iterations = iterations self.values = util.Counter() # A Counter is a dictionary with default 0 for transition in range(0, self.iterations, 1): states = self.mdp.getStates() valuesCopy = self.values.copy() for state in states: legalMoves = self.mdp.getPossibleActions(state) convergedValue = 0 for move in legalMoves: value = self.computeQValueFromValues(state, move) if convergedValue <= value or convergedValue == 0: convergedValue = value valuesCopy.update({state: convergedValue}) self.values = valuesCopy def computeQValueFromValues(self, state, action): successors = self.mdp.getTransitionStatesAndProbs(state, action) reward = self.mdp.getReward(state, action, successors) qValue = 0 for successor, probability in successors: # The Q value equation: Q*(a,s) = T(s,a,s')[R(s,a,s') + gamma(V*(s'))] qValue += probability * (reward + (self.discount * self.values[successor])) return qValue

This implementation is true, although I'm not sure why I need valuesCopy to successfully update my self.values dictionary. I tried the following to omit copying, but it doesn’t work , since it returns several incorrect values:

 for i in range(0, self.iterations, 1): states = self.mdp.getStates() for state in states: legalMoves = self.mdp.getPossibleActions(state) convergedValue = 0 for move in legalMoves: value = self.computeQValueFromValues(state, move) if convergedValue <= value or convergedValue == 0: convergedValue = value self.values.update({state: convergedValue})

My question is why does it include a copy of my self.values dictionary, which is necessary to correctly update my values, when valuesCopy = self.values.copy() makes a copy of the dictionary anyway at each iteration? Shouldn't the values in the original result be updated in the same update?

+5

python dictionary iteration python-2.7 artificial-intelligence

Jodo1992 Apr 2 '16 at 6:17

source share

1 answer

Jacques de hooge · Accepted Answer · 2016-04-02T07:19:23+0000

There is an algorithmic difference in the presence or absence of a copy:

 # You update your copy here, so the original will be used unchanged, which is not the # case if you don't have the copy valuesCopy.update({state: convergedValue}) # If you have the copy, you'll be using the old value stored in self.value here, # not the updated one qValue += probability * (reward + (self.discount * self.values[successor]))

Why is a shallow copy necessary to properly update my value dictionary?

More articles: