So, I read about Q-learning and neural networks. I believe that I have the right idea, but I would like to get a second opinion on my code for NN and updating with Q values.
I created a MatLab implementation of the problem with Mountain Car and my neural network, I use a set of neural networks for the NN part.
This is a network of 2 inputs, 5-20 hidden (for experiments) and 3 outputs (corresponding to actions in a mountain car)
hidden units are set to tansig and output is purelin and training function is traingdm
Are these the right steps?
- get inital state s → [-0.5; 0,0]
- start the network with Qs = net (s) ... this gives me a matrix of 1x3 Q-values corresponding to each action in int state.
- select action a using e-greedy choice
- simulate a mountain car and get s' (new state as a result of step a)
- run the network using Qs_prime = net (s ') to get another matrix for the Q-values s'
Now I'm not sure if this is correct, since I need to figure out how to properly update the scales for NN.
- Calculate QTarget, i.e. = reward + gamma * Max. Q value from s'?
- Create a Targets matrix (1x3) with Q-values from inital s and change the corresponding Q-value for the performed action a as QTarget
- use net = Train (net, s, Targets) to update weights in NN
- S = s'
- repeat all of the above for the new s
Example:
actions 1 2 3 Qs = 1.3346 -1.9000 0.2371 selected action 3(corresponding to move mountain car forward) Qs' = 1.3328 -1.8997 0.2463 QTarget=reward+gamma*max(Qs') = -1+1.0*1.3328 = 0.3328 s= [-5.0; 0.0] and Targets = 1.3346 -1.9000 0.3328 Or I have this wrong and the Targets are 0 0 0.3328 since we don't know how good the other actions are..
here is my Matlab code (I am using R2011 and Neural Network Toolbox)
%create a neural network num_hidden=5 num_actions=3 net= newff([-1.2 0.6; -0.07 0.07;], [num_hidden,num_actions], {'tansig', 'purelin'},'traingdm'); %network weight and bias initalization net= init(net); %turn off the training window net.trainParam.showWindow = false; %neural network training parameters net.trainParam.lr=0.01; net.trainParam.mc=0.1; net.trainParam.epochs=100 %parameters for q learning epsilon=0.9; gamma=1.0; %parameters for Mountain car task maxEpisodes =10; maxSteps=5000; reset=false; inital_pos=-0.5; inital_vel=0.0; %construct the inital state s=[inital_pos;inital_vel]; Qs=zeros(1,3); Qs_prime=zeros(1,3); %training for maxEpisodes for i=1:maxEpisodes %each episode is maxSteps long for j = 1:maxSteps %run the network and get Q values for current state Qs-> vector of %current Q values for state s at time t Q(s_t) Qs=net(s); %select an action if (rand() <= epsilon) %returns max Q value over all actions [Qs_value a]=max(Qs); else %return a random number between 1 and 3(inclusive) a = randint(1,1,3)+1; end %simulate a step of Mountain Car [s_prime, action, reward, reset] = SimulateMC(s,a); %get new Q values for S_prime -> Q(s_t+1) Qs_prime=net(s_prime); %Compute Qtarget for weight updates given by r+y*max Q(s_t+1) over all %actions Q_target = reward+gamma*max(Qs_prime); %Create a Targets matrix with the orginal state s q-values Targets=Qs; %change q-value of the original action to the QTarget Targets(a)=Q_target; % update the network for input state s and targets [net TR]=train(net,s,Targets); %update the state for next step s=s_prime; %display exactly where the car is to user the NN learns if this output reaches -0.45 disp(s(1)) if reset==true bestSteps=j break end end %reset for new episode reset=false; s=[inital_pos;inital_vel]; end %test the network %reset state s=[inital_pos;inital_vel]; for i=1:maxEpisodes for j=1:maxSteps %run the network and get Q values for current state Qs=net(s); %select the max action always [Qs_value a]=max(Qs); %simulate a step of Mountain Car [s_prime, action, reward, reset] = SimulateMC(s,a); s=s_prime; disp(s(1)) end s=[inital_pos;inital_vel]; end
thanks