Neural Network Q-Training - Mountain Car

So, I read about Q-learning and neural networks. I believe that I have the right idea, but I would like to get a second opinion on my code for NN and updating with Q values.

I created a MatLab implementation of the problem with Mountain Car and my neural network, I use a set of neural networks for the NN part.

This is a network of 2 inputs, 5-20 hidden (for experiments) and 3 outputs (corresponding to actions in a mountain car)

hidden units are set to tansig and output is purelin and training function is traingdm

Are these the right steps?

  • get inital state s → [-0.5; 0,0]
  • start the network with Qs = net (s) ... this gives me a matrix of 1x3 Q-values ​​corresponding to each action in int state.
  • select action a using e-greedy choice
  • simulate a mountain car and get s' (new state as a result of step a)
  • run the network using Qs_prime = net (s ') to get another matrix for the Q-values ​​s'

Now I'm not sure if this is correct, since I need to figure out how to properly update the scales for NN.

  • Calculate QTarget, i.e. = reward + gamma * Max. Q value from s'?
  • Create a Targets matrix (1x3) with Q-values ​​from inital s and change the corresponding Q-value for the performed action a as QTarget
  • use net = Train (net, s, Targets) to update weights in NN
  • S = s'
  • repeat all of the above for the new s

Example:

actions 1 2 3 Qs = 1.3346 -1.9000 0.2371 selected action 3(corresponding to move mountain car forward) Qs' = 1.3328 -1.8997 0.2463 QTarget=reward+gamma*max(Qs') = -1+1.0*1.3328 = 0.3328 s= [-5.0; 0.0] and Targets = 1.3346 -1.9000 0.3328 Or I have this wrong and the Targets are 0 0 0.3328 since we don't know how good the other actions are.. 

here is my Matlab code (I am using R2011 and Neural Network Toolbox)

 %create a neural network num_hidden=5 num_actions=3 net= newff([-1.2 0.6; -0.07 0.07;], [num_hidden,num_actions], {'tansig', 'purelin'},'traingdm'); %network weight and bias initalization net= init(net); %turn off the training window net.trainParam.showWindow = false; %neural network training parameters net.trainParam.lr=0.01; net.trainParam.mc=0.1; net.trainParam.epochs=100 %parameters for q learning epsilon=0.9; gamma=1.0; %parameters for Mountain car task maxEpisodes =10; maxSteps=5000; reset=false; inital_pos=-0.5; inital_vel=0.0; %construct the inital state s=[inital_pos;inital_vel]; Qs=zeros(1,3); Qs_prime=zeros(1,3); %training for maxEpisodes for i=1:maxEpisodes %each episode is maxSteps long for j = 1:maxSteps %run the network and get Q values for current state Qs-> vector of %current Q values for state s at time t Q(s_t) Qs=net(s); %select an action if (rand() <= epsilon) %returns max Q value over all actions [Qs_value a]=max(Qs); else %return a random number between 1 and 3(inclusive) a = randint(1,1,3)+1; end %simulate a step of Mountain Car [s_prime, action, reward, reset] = SimulateMC(s,a); %get new Q values for S_prime -> Q(s_t+1) Qs_prime=net(s_prime); %Compute Qtarget for weight updates given by r+y*max Q(s_t+1) over all %actions Q_target = reward+gamma*max(Qs_prime); %Create a Targets matrix with the orginal state s q-values Targets=Qs; %change q-value of the original action to the QTarget Targets(a)=Q_target; % update the network for input state s and targets [net TR]=train(net,s,Targets); %update the state for next step s=s_prime; %display exactly where the car is to user the NN learns if this output reaches -0.45 disp(s(1)) if reset==true bestSteps=j break end end %reset for new episode reset=false; s=[inital_pos;inital_vel]; end %test the network %reset state s=[inital_pos;inital_vel]; for i=1:maxEpisodes for j=1:maxSteps %run the network and get Q values for current state Qs=net(s); %select the max action always [Qs_value a]=max(Qs); %simulate a step of Mountain Car [s_prime, action, reward, reset] = SimulateMC(s,a); s=s_prime; disp(s(1)) end s=[inital_pos;inital_vel]; end 

thanks

+4
source share
1 answer

Presentation of the problem

Using neural networks to represent the value-action function is a good idea. This has been shown to work well for a number of applications. However, a more natural representation for the Q-function would be a network that takes a combined action action vector as an input and has a scalar output. But while the number of actions is finite and small, it should be possible to do this, just like you. Just remember that, strictly speaking, you are not studying Q (s, a), but several functions of V (s) value (one for each action) that have the same weight, except for the last layer.

Testing

This is a direct greedy use of the Q function. It must be correct.

Training

There are several pitfalls that you will have to think about. The first is scaling. To train a neural network, you really need to scale inputs to the same range. If you use the sigmoidal activation function at the output level, you will also have to scale the target values.

Data efficiency is another thing to think about. You can do several network updates with each transition example. Learning will be faster, but you will need to store each transition pattern in memory.

Online vs. batch: if you are storing your sample, you can do batch training and avoid the problem when the latest samples erase the already discovered parts of the problem.

Literature

You should take a look at

0
source

Source: https://habr.com/ru/post/1496828/


All Articles