3

In the MuZero paper pseudocode, they have the following line of code:

hidden_state = tf.scale_gradient(hidden_state, 0.5)

What does this do? Why is it there?

I've searched for tf.scale_gradient and it doesn't exist in tensorflow. And, unlike scalar_loss, they don't seem to have defined it in their own code.

For context, here's the entire function:

def update_weights(optimizer: tf.train.Optimizer, network: Network, batch,
                   weight_decay: float):
  loss = 0
  for image, actions, targets in batch:
    # Initial step, from the real observation.
    value, reward, policy_logits, hidden_state = network.initial_inference(
        image)
    predictions = [(1.0, value, reward, policy_logits)]

    # Recurrent steps, from action and previous hidden state.
    for action in actions:
      value, reward, policy_logits, hidden_state = network.recurrent_inference(
          hidden_state, action)
      predictions.append((1.0 / len(actions), value, reward, policy_logits))

      # THIS LINE HERE
      hidden_state = tf.scale_gradient(hidden_state, 0.5)

    for prediction, target in zip(predictions, targets):
      gradient_scale, value, reward, policy_logits = prediction
      target_value, target_reward, target_policy = target

      l = (
          scalar_loss(value, target_value) +
          scalar_loss(reward, target_reward) +
          tf.nn.softmax_cross_entropy_with_logits(
              logits=policy_logits, labels=target_policy))

      # AND AGAIN HERE
      loss += tf.scale_gradient(l, gradient_scale)

  for weights in network.get_weights():
    loss += weight_decay * tf.nn.l2_loss(weights)

  optimizer.minimize(loss)

What does scaling the gradient do, and why are they doing it there?

Pro Q
  • 175
  • 5
  • 1
    I have searched for `tf.scale_gradient()` on the TensorFlow's website. [As the results show](https://www.tensorflow.org/s/results?q=scale_gradient), nothing comes out. It must be a function from old TF versions that now has been abandoned. For sure, it's no more available in TF 2.0. – Leevo Jan 02 '20 at 08:37
  • I don't believe it's ever been a function in tensorflow, given the lack of results from a Google search for it. – Pro Q Jan 06 '20 at 16:55

2 Answers2

4

Author of the paper here - I missed that this is apparently not a TensorFlow function, it's equivalent to Sonnet's scale_gradient, or the following function:

def scale_gradient(tensor, scale):
  """Scales the gradient for the backward pass."""
  return tensor * scale + tf.stop_gradient(tensor) * (1 - scale)
Mononofu
  • 156
  • 1
  • Thank you very much for the reply! If you would be willing to look at https://stackoverflow.com/q/60234530 (another MuZero question), I would greatly appreciate it. – Pro Q Feb 14 '20 at 23:14
1

Given that its pseude code? (since its not in TF 2.0) I would go with gradient clipping or batch normalisation ('scaling of activation functions')

Noah Weber
  • 5,609
  • 1
  • 11
  • 26
  • From the link you provided, it looks like this would likely be gradient norm scaling, which translates to setting a `clipnorm` parameter in the optimizer. However, in the code they use gradient scaling twice in the code with different values each time. The `clipnorm` parameter would not allow me to do this. Do you know how I could? – Pro Q Jan 06 '20 at 16:47
  • Also, the hidden state of a model doesn't seem like something that should be clipped. (I don't understand why clipping it would be helpful at all.) Explaining what gradient clipping would be doing there would be extremely helpful for me to be certain that your answer is correct. – Pro Q Jan 06 '20 at 16:56