What is the advantage of a deterministic policy gradient over a stochastic policy gradient?

Deep Deterministic Policy Gradient ( DDPG ) is a modern method of teaching reinforcement when the action space is continuous. Its main algorithm is a deterministic gradient policy .

However, after reading the articles and listening to the conversation ( http://techtalks.tv/talks/deterministic-policy-gradient-algorithms/61098/ ), I still can not understand what a fundamental advantage for deterministic PG over Stochastic PG. The conversation says that he is more suitable for arrogant actions and easier to train, but why is this?

+5
source share
2 answers

The main reason for the policy gradient method is to solve the problem of continuous action space, which is difficult for Q-learning due to the maximization of Global Q.

SPG can solve the problem of continuous operation, since it is a policy through the continuous distribution of probabilities. Since the SPG assumes that its policy is a distribution, an integral over actions is required to achieve a gradient of overall reward. SPG resorts to an important sample for this integration.

DPG is a state-to-action deterministic mapping policy. He can do this because he does not take the action of the global largest Q, but he chooses the actions in accordance with the deterministic map (if on politics), and the shift of this deterministic map on the gradient Q (both in politics and outside). The gradient of the overall reward then has the form, does not need an integral over actions, and it is easier to calculate.

We can say that this seems to be a transition from stochastic politics to deterministic politics. But stochastic policy is first introduced only to process the space of continuous operations. Deterministic policy now provides another way to handle a continuous action space.

My observation is derived from these works:

Deterministic Policy Gradient Algorithms

Gradient training methods for harp training with function approximation

Continuous monitoring with advanced training

+2
source

Since the policy is deterministic, and not stochastic, that is, there is only one action that will be selected for each state.

0
source

Source: https://habr.com/ru/post/1265381/


All Articles