The main reason for the policy gradient method is to solve the problem of continuous action space, which is difficult for Q-learning due to the maximization of Global Q.
SPG can solve the problem of continuous operation, since it is a policy through the continuous distribution of probabilities. Since the SPG assumes that its policy is a distribution, an integral over actions is required to achieve a gradient of overall reward. SPG resorts to an important sample for this integration.
DPG is a state-to-action deterministic mapping policy. He can do this because he does not take the action of the global largest Q, but he chooses the actions in accordance with the deterministic map (if on politics), and the shift of this deterministic map on the gradient Q (both in politics and outside). The gradient of the overall reward then has the form, does not need an integral over actions, and it is easier to calculate.
We can say that this seems to be a transition from stochastic politics to deterministic politics. But stochastic policy is first introduced only to process the space of continuous operations. Deterministic policy now provides another way to handle a continuous action space.
My observation is derived from these works:
Deterministic Policy Gradient Algorithms
Gradient training methods for harp training with function approximation
Continuous monitoring with advanced training
source share