Why doesn't training RNNs use 100% of the GPU?

Question

I wonder why training RNNs typically doesn't use 100% of the GPU.

For example, if I run this RNN benchmark on a Maxwell Titan X on Ubuntu 14.04.4 LTS x64, the GPU utilization is below 90%:

The benchmark was launched using the command:

python rnn.py -n 'fastlstm' -l 1024 -s 30 -b 128

How can I diagnose what the bottleneck is?

davidparks21 · Answer 1 · 2016-10-12T00:36:39.640

5

I get about this same utilization rate when I train models using Tensorflow. The reason is pretty clear in my case, I'm manually choosing a random batch of samples and calling the optimization for each batch separately.

That means that each batch of data is in main memory, it's then copied into GPU memory where the rest of the model is, then forward/back propagation and update is performed in-gpu, then execution is handed back to my code where I grab another batch and call optimize on it.

There's a faster way to do that if you spend a few hours setting up Tensorflow to do batch loading in parallel from pre-prepared TF records.

I realize you may or may not be using tensorflow under keras, but since my experience tends to produce very similar utilization numbers, I'm going out on a limb by suggesting that there's a reasonably likely causal link to draw from these correlations. If your framework is loading each batch from main memory into the GPU without the added efficiency/complexity of asynchronous loading (which the GPU itself can handle), then this would be an expected result.

edited Oct 12 '16 at 00:36

answered Oct 08 '16 at 22:35

davidparks21

413
4
17

That is also my suspicion, do you know how to [monitor the PCI Express bus usage in Linux with CLI](http://softwarerecs.stackexchange.com/q/36676/903)? – Franck Dernoncourt Oct 08 '16 at 23:39
I can't say I've ever considered trying to monitor that, but you've got me curious now. – davidparks21 Oct 10 '16 at 14:51
I always get about 90% with keras too. I have never checked its source code, but I would expect a framework like that to feed batches to the gpu in parallel (at least optionally if you can afford the memory.. why shouldn't it?). – stmax Oct 11 '16 at 05:40
1

Keras is a wrapper around Tensorflow or Theano. I only have experience in Tensorflow, so speaking from that perspective the process necessary to enable asynchronous data loading requires that you pre-process the data into a Tensorflow specific binary format. I'll assume that Theano has some equivalent way of doing this. Most models probably start with per-batch memory loading (get it working before optimizing for speed). And since the charter of Keras is "keep it simple", it might not go unexpected that they would take the more direct/simple approach for only a ~10% loss in efficiency. – davidparks21 Oct 12 '16 at 00:10

Why doesn't training RNNs use 100% of the GPU?

1 Answers1