0

I'm trying to replicate the DQN Atari experiment. Actually my DQN isn't performing well; checking another one's codes, I saw something about experience replay which I don't understand. First, when you define your CNN, in the first layer you have to specify the size (I'm using Keras + Tensorflow so in my case it's something like (105, 80, 4), which corresponds to height, width and number of images I feed my CNN.). In the codes I revisited, when they get the minibatch from the memory, I see they usually fed the CNN without "packing" it on 4 batches. How it is possible? I mean for example if you get 32 random sampled experiences, don't you need to make batches of 4 before feeding it? Here are an example of what I'm saying: https://github.com/yilundu/DQN-DDQN-on-Space-Invaders/blob/master/replay_buffer.py https://github.com/yilundu/DQN-DDQN-on-Space-Invaders/blob/master/deep_Q.py In this code, that's how he/she stores the experiences:

def add(self, s, a, r, d, s2):
        """Add an experience to the buffer"""
        # S represents current state, a is action,
        # r is reward, d is whether it is the end, 
        # and s2 is next state
        experience = (s, a, r, d, s2)
        if self.count < self.buffer_size:
            self.buffer.append(experience)
            self.count += 1
        else:
            self.buffer.popleft()
            self.buffer.append(experience)

Then when you need to use them:

 def sample(self, batch_size):
        """Samples a total of elements equal to batch_size from buffer
        if buffer contains enough elements. Otherwise return all elements"""

        batch = []

        if self.count < batch_size:
            batch = random.sample(self.buffer, self.count)
        else:
            batch = random.sample(self.buffer, batch_size)

        # Maps each experience in batch in batches of states, actions, rewards
        # and new states
        s_batch, a_batch, r_batch, d_batch, s2_batch = list(map(np.array, list(zip(*batch))))

        return s_batch, a_batch, r_batch, d_batch, s2_batch

Ok, so now you have a batch of 32 states, actions, rewards, done and next states.

This is how you feed the state batch (s_batch) and next state batch (s2_batch) to the CNN:

def train(self, s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num):
        """Trains network to fit given parameters"""
        batch_size = s_batch.shape[0]
        targets = np.zeros((batch_size, NUM_ACTIONS))

        for i in range(batch_size):
            targets[i] = self.model.predict(s_batch[i].reshape(1, 84, 84, NUM_FRAMES), batch_size = 1)
            fut_action = self.target_model.predict(s2_batch[i].reshape(1, 84, 84, NUM_FRAMES), batch_size = 1)
            targets[i, a_batch[i]] = r_batch[i]
            if d_batch[i] == False:
                targets[i, a_batch[i]] += DECAY_RATE * np.max(fut_action)

        loss = self.model.train_on_batch(s_batch, targets)

        # Print the loss every 10 iterations.
        if observation_num % 10 == 0:
            print("We had a loss equal to ", loss)

In my code (https://bitbucket.org/jocapal/dqn_public/src/master/Deimos_v2_13.py) I get a batch of 32 experiences; then make small batches of 4 experiences and feed the CNN. My question is: am I doing it wrong? And if so, how can I feed 32 experiences when my CNN is waiting for 4 experiences?

Another example of what I'm saying: https://yanpanlau.github.io/2016/07/10/FlappyBird-Keras.html

Joaquin
  • 1
  • 3

1 Answers1

1

Input is a 4D tensor [batch_size, height, width, channels] . Single state is already 4 frames stacked together so when you sample a state from the experience replay you sample a 3D tensor [height, width, channels]. When you sample 32 states you actually sample 32 of those 3D tensors and feed them directly to the network. For more details on preprocessing refer to the page 6 of the original DQN paper found here.

Brale
  • 111
  • 1
  • But what about breaking the correlations between samples? If I did that way there would be a strong correlation between images (as stated here: https://datascience.stackexchange.com/questions/24921/why-random-sample-from-replay-for-dqn?rq=1) and here, in pages 4-5: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf – Joaquin Apr 11 '19 at 16:48
  • Joaquin, you missunderstood. 4 frames stacked together are a **single** state. You need it because the environment is partially observable. For example, if you play Pong and you only have a single frame, you won't know if the ball goes right or left. You need few frames stacked together to get that information. Those few frames together are a single state. Correlation part applies between different states which are different frames stacked together. You would get "strong correlation" only if you keep giving state after state as the input, 4 frames together won't cause such correlation. – Brale Apr 11 '19 at 17:13
  • Thanks Brale_ for you clarification. I'll try that way and see if it works. Thanks for your time! – Joaquin Apr 11 '19 at 17:21
  • Just for clarify: when talking about the minibatches you pick randomly from the memory batch, it's referring to packs of 4 states, not to single states, right? – Joaquin Apr 12 '19 at 01:32
  • Yes, you pick 32 packs of 4 states, of course those 4 states have to be in order they happened to have full information, the thing that I talked about with Pong. – Brale Apr 12 '19 at 07:36