1

As a part of my research in Deep Learning, I have to frequently train models which require a lot of computing power. As such, I use my university's HPC environment to submit my jobs and to train my models.

However, I run into one major issue - MONITORING THE TRAINING PERFORMANCE & METRICS!

I generally build my models with Keras, and it is convenient to check the console from time to time to get to know about the model training/performance.

There's a tool - CometML, which I use when I train models on my own system. However, as the HPC does not allow socket connections, it's not possible to monitor.


Is there a way/tool which can be used to monitor the metrics? For now, I take a dump of the logs from time-to-time and download them into my system and then check. But it's extremely time-consuming and inefficient.

If there is an efficient way/tool, please let me know.

Thanks.

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102
Adhish Thite
  • 111
  • 2

0 Answers0