Agent always takes a same action in DQN - Reinforcement Learning

Question

I have trained an RL agent using DQN algorithm. After 20000 episodes my rewards are converged. Now when I test this agent, the agent is always taking the same action , irrespective of state. I find this very weird. Can someone help me with this. Is there a reason, anyone can think of why is the agent behaving this way?

Reward plot

When I test the agent

state = env.reset()
print('State: ', state)

state_encod = np.reshape(state, [1, state_size])
q_values = model.predict(state_encod)
action_key = np.argmax(q_values)
print(action_key)
print(index_to_action_mapping[action_key])
print(q_values[0][0])
print(q_values[0][action_key])

q_values_plotting = []
for i in range(0,action_size):
    q_values_plotting.append(q_values[0][i])


plt.plot(np.arange(0,action_size),q_values_plotting)

Every time it gives the same q_values plot, even though state initialized is different every time.Below is the q_Value plot.

Testing:

code

test_rewards = []
for episode in range(1000):
    terminal_state = False
    state = env.reset()
    episode_reward = 0
    while terminal_state == False:
        print('State: ', state)
        state_encod = np.reshape(state, [1, state_size])
        q_values = model.predict(state_encod)
        action_key = np.argmax(q_values)
        action = index_to_action_mapping[action_key]
        print('Action: ', action)
        next_state, reward, terminal_state = env.step(state, action)
        print('Next_state: ', next_state)
        print('Reward: ', reward)
        print('Terminal_state: ', terminal_state, '\n')
        print('----------------------------')
        episode_reward += reward
        state = deepcopy(next_state)
    print('Episode Reward' + str(episode_reward))
    test_rewards.append(episode_reward)

plt.plot(test_rewards)

Thanks.

Is taking the same action in every state in any way close to optimal behaviour? Or is it worse than behaving randomly? How are you measuring "my rewards are converged" and what else are you measuring? Have you plotted episode return vs number of episodes (smoothed)? For concreteness, it may be useful to share details of the environment, your state representation, the actions and rewards. This would help in case you have made a mistake in problem analysis. Although more likely you have an implementation detail wrong, as there are lots of places in DQN agents that can go wrong in implementation. — Neil Slater, Oct 04 '19 at 18:30
I am plotting total rewards in an episode vs the episodes . It converges after 10000 episodes. Please suggest if any other criterion has to be checked, before assuming your agent is trained enough. — chink, Oct 04 '19 at 18:48
Yes you can put a link to the notebook into the question. However, please don't expect volunteers here to work on and debug the project based on the question as is. Add the link, *and* also summarise the important details in the question - use [edit] — Neil Slater, Oct 04 '19 at 18:48
One related question then - when you test the agent does it get the same amount of reward as you are plotting during training? — Neil Slater, Oct 04 '19 at 18:49
when i test the trained agent, rewards are varying each time i run a episode — chink, Oct 05 '19 at 13:09
That's not what I meant, are the rewards the agent receives it receives during testing *consistent* with the values it receives during training? In other words, your training routine appears to be converging on a stable expected reward total per episode, so you think training is complete. Then you test the agent and note that it is always taking the same action. If you plotted the results from those test episodes, same as you plotted it during training, would teh graph show a similar level? — Neil Slater, Oct 05 '19 at 13:14
Hi, I have plotted the test results(edited & added in the question) . The test rewards are similar to training rewards. What does this mean? Why is the agent always taking the same action ? — chink, Oct 05 '19 at 14:56
Not really possible to say. I don't see any obvious errors in your plotting code. You may need to explain about the control problem itself. Am I correct in thinking from your Q values plot that you have 500 possible actions? And that is it picking an action with id around 250 as the maximising action in each state? — Neil Slater, Oct 05 '19 at 15:24
Any mistakes you can think of while training the agent, which is leading to this behaviour? — chink, Oct 05 '19 at 16:10
add randomness ($\epsilon$-greedy strategy etc.) and make sure each episode replay buffer has new data (also be sure to wipe out bad old replays), also could you check that `predict` outputs different values each time? — quester, Oct 15 '19 at 20:26
check if your training data isn't skewed or 90% are 0 or something similiar — quester, Oct 15 '19 at 20:42

score 5 · Answer 1 · answered Oct 17 '19 at 07:46

This may seem obvious, but have you tried using a Boltzmann distribution for action selection instead of argmax? This is known to encourage exploration and can be done by setting the action policy to

$$p(a|s) = \frac{\exp(\beta Q(a,s)}{\sum_{a'} \exp(\beta Q(a',s))},$$

where $\beta$ is the temperature parameter and governs the exploration-exploitation trade-off. This is also known as the softmax distribution.

Put into code, this would be something like this:

beta = 1.0
p_a_s = np.exp(beta * q_values)/np.sum(np.exp(beta * q_values))
action_key = np.random.choice(a=num_act, p=p_as)

This can lead to numerical instabilities because of the exponential, but that can be handled e.g. by first subtracting the highest q value:

q_values = q_values - np.max(q_vaues)

score 0 · Answer 2 · answered Oct 16 '19 at 06:30

The action taken by agent can be the most optimal action.
If the same state is input, you might be getting the same reward. Might be state not getting updated properly. Since next_state is given by agent, check the deepcopy function.
The model might not be updating it's parameters or it's q-values. Check how the model updates it's parameters and q-values.

Agent always takes a same action in DQN - Reinforcement Learning

2 Answers2