4

Are there publications which mention numerical problems in neural network optimization?

(Blog posts, articles, workshop notes, lecture notes, books - anything?)

Background of the question

I've recently had a strange phenomenon: When I trained a convolutional network on the GTSRB dataset with a given script on my machine, it got state of the art results (99.9% test accuracy). 10 times. No outlier. When I used the same scripts on another machine, I got much worse results (~ 80% test accuracy or so, 10 times, no outliers). I thought that I probably didn't use the same scripts and as it was not important for my publication I just removed all results of that dataset. I thought I probably made a mistake one one of the machines (e.g. using different pre-processed data) and I couldn't find out where the mistake happened.

Now a friend wrote me that he has a network, a training script and a dataset which converges on machine A but does not converge on machine B. Exactly the same setup (a fully connected network trained as an autoencoder).

I have only one guess what might happen: The machines have different hardware. It might be possible that Tensorflow uses different algorithms for matrix multiplication / gradient calculation. Their numeric properties might be different. Those differences might cause one machine to be able to optimize the network while the other can't.

Of course, this needs further investigation. But no matter what is happening in those two cases, I think this question is interesting. Intuitively, I would say that numeric issues should not be important as sharp minima are not desired anyway and differences in one multiplication are less important than the update of the next step.

Martin Thoma
  • 18,630
  • 31
  • 92
  • 167
  • 1
    I am assuming that you use the same versions of all (Python) packages, use seeds for shuffling the data set and use the same initial weights. Then, If your results still diverge, the differences could lay in the used low level cuda or c libraries. Actually, what you described is the reason why you should ALWYS perform your calculations inside a virtualmachine. I have seen it many times. Always use vms and an orchestration Tool for your simulations (e.g. Vagrant with ansible). This will also make sure that you can reproduce your results years latter. Edit: If you want to use GPU inside the VM, – MaxBenChrist Jul 09 '17 at 12:10
  • But VMs make the training much slower, don't they? – Martin Thoma Jul 09 '17 at 15:12
  • https://askubuntu.com/a/598335/10425 suggests that you can't use the GPU while being in VM. This would basically mean I can't get any results. Hence a VM is not an option. – Martin Thoma Jul 09 '17 at 15:13
  • I have not used GPUs inside the VM but I am sure that you can make it work. In Industry projects I was involved in, we always used vms. There are some caveats, e.g. virtual box performance drops once you hit 32 or more cores, – MaxBenChrist Jul 09 '17 at 15:14
  • 1
    anyway: it seem some work to get the pic express passthrough, see https://arrayfire.com/using-gpus-kvm-virutal-machines/ But, docker seems to be an option: https://github.com/floydhub/dl-docker – MaxBenChrist Jul 09 '17 at 15:26
  • @MartinThoma Many deep learning setups are very sensitive to parameter initialization. From what i recall (things might have changed in latest CUDA), there is an inherent randomness in GPU computation that's beyond the control of random seed. That might explain what you see. – horaceT Jul 09 '17 at 22:50
  • 1
    a VM does not guarantee reproducibility. Even with the same software you can easily get different results on different hardware if you have numerical instabilities. – etarion Jul 11 '17 at 18:06
  • Even with the exact same hardware and software, numerical differences can occur when distributing computations across cores of a GPU and then summing the results in a non deterministic order simply because floating point addition is NOT associative. – kbrose Nov 08 '17 at 04:33

1 Answers1

1

Are there publications which mention numerical problems in neural network optimization?

Of course, there has been a lot of research on vanishing gradients, which is entirely a numerical problem. There is also a fair amount of research of training with low precision operations, but the result is surprising: reduced floating point precision doesn't seem to affect neural network training. This means that precision loss is pretty unlikely to be the cause of this phenomenon.

Still, the environment can affect the computation (as suggested in the comments):

  • Most obviously, random-number generator. Use a seed in your script and try to make a reproducible result at least on a single machine. After that you can compute the summary of activations and gradients (e.g. via tf.summary in tensorflow) and compare the tensors across the machines. Clearly, basic operations such as matrix multiplication or piece-wise exponent should give very close if not identical result, no matter what hardware is used. You should be able to see if the tensors diverge immediately (which means there is another source of randomness) or gradually.

  • python interpreter, cuda, cudnn driver and key libraries versions (numpy, tensorflow, etc). You can go as far the same linux kernel and libc version, but I think you should expect reproducibility even without it. cudnn version is important, because the convolution is likely to be natively optimized. tensorflow is also very important, because Google rewrites the core all the time.

  • Environment variables, e.g. PATH, LD_LIBRARY_PATH, etc., linux configuration parameters, e.g. limits.conf, permissions.

Extra precautions:

  • Explicitly specify the type of each variable, don't rely on defaults.

  • Double check that the training / test data is identical and is read in the same order.

  • Does your computation use any pre-trained models? Any networking involved? Check that as well.

I would suspect hardware differences the last: it's an extraordinary case if the high-level computation without explicit concurrency leads to different results (except floating point precision differencies) depending on the number of cores or cache size.

Maxim
  • 880
  • 1
  • 9
  • 20