TD (λ) in Delphi / Pascal (time difference)

Question

TD (λ) in Delphi / Pascal (time difference)

I have an artificial neural network playing Tic-Tac-Toe, but it is not completed yet.

What else do I have:

reward array "R [t]" with integer values for each time or move "t" (1 = player A wins, 0 = draw, -1 = player B wins)
Input values are correctly propagated through the network.
formula for adjusting weights:

enter image description here

What is missing:

Training for APs: I still need a procedure that "draws back" network errors using the TD (λ) algorithm.

But I do not really understand this algorithm.

My approach so far ...

The decay parameter λ should be "0.1", since distal states should not receive most of the reward.

The learning speed is "0.5" in both layers (input and hidden).

This is a case of delayed rewards: the reward remains "0" until the game is over. Then the reward becomes “1” for the victory of the first player, “-1” for the second player or “0” in the event of a tie.

My questions:

How and when do you calculate the network error (TD error)?
How can you implement back propagation errors?
How are weights adjusted using TD (λ)?

Thank you very much in advance :)

+4

artificial-intelligence reinforcement-learning temporal-difference neural-network

caw Jan 30 '11 at 20:59

source share

3 answers

I, too, was embarrassed about this, but I believe that this is how it works:

Starting at the end of the node, you check R, (the result) and E (expected to exit). If E = R, this is normal and you are not making any changes.

If E! = R, you see how far it was, based on threshold values and something else, and then slightly shift the weight or threshold up or down. Then, based on the new scales, you go back and assume if it was too high or too low, and repeat with a weaker effect.

I have never tried this algorithm, but it is basically a version of the idea, as I understand it.

0

Taslemguy Jan 31 '11 at 1:24

source share

As far as I remember, you train with a known set of results, so you calculate the output for a known input and subtract your known output value from it - this is an error.

Then you use the error to fix the network - for one NN layer adjusted using the delta rule, I know that epsilon 0.5 is too high - better than 0.1 - slower but better. With backpropagation, it is a little more advanced, but as far as I remember, the mathematical equation for describing NN is complicated and hard to understand - it's not so difficult.

take a look at http://www.codeproject.com/KB/recipes/BP.aspx

or google for "backpropagation c" - this is probably easier to understand in the code.

0

Rune andersen Feb 08 '11 at 21:14

source share

darkcanuck · Accepted Answer · 2011-01-30T22:56:18+0000

If you are serious about doing this work, then understanding TD-lambda would be very helpful. Sutton and Barto's book “Strengthening Learning” is available for free in HTML format and describes this algorithm in detail. Basically, what TD-lambda does is create a comparison between the game state and the expected reward at the end of the game. When games are played, states that are more likely to lead to winning states tend to receive higher expected reward values.

For a simple game such as tic-tac-toe, you'd better start with a table comparison (just keep track of the expected reward value for each possible game state). Then, once you get started, you can try using NN for matching. But I would suggest trying a separate, simpler NN project first ...

TD (λ) in Delphi / Pascal (time difference)

More articles: