0

I am running a Deep RL algorithm. I defined a custom reward function. I run the algorithm for at least 500 epochs.

For each epoch, I am printing the total reward received by the actor-network. It is around $- 10^5$ for the first epoch. After the completion of 500 epochs, the value is almost similar.

I ran with the hope that after enough epochs, the value decreases and move toward zero. But my guess went wrong.

What can I infer from his? Is my actor learning? If yes, then why do the total reward values not decrease?

Note that I want to infer the learning of my actor network from total reward per epoch only and not from its actions in the environment.

hanugm
  • 157
  • 1
  • 9

0 Answers0