Monitor Model Training Progress over HPC Clusters

Asked Jul 05 '18 at 02:57

Active Oct 20 '21 at 22:42

Viewed 57 times

As a part of my research in Deep Learning, I have to frequently train models which require a lot of computing power. As such, I use my university's HPC environment to submit my jobs and to train my models.

However, I run into one major issue - MONITORING THE TRAINING PERFORMANCE & METRICS!

I generally build my models with Keras, and it is convenient to check the console from time to time to get to know about the model training/performance.

There's a tool - CometML, which I use when I train models on my own system. However, as the HPC does not allow socket connections, it's not possible to monitor.

Is there a way/tool which can be used to monitor the metrics? For now, I take a dump of the logs from time-to-time and download them into my system and then check. But it's extremely time-consuming and inefficient.

If there is an efficient way/tool, please let me know.

Thanks.

edited Oct 20 '21 at 22:42

Brian Spiering

20,142
2
25
102

asked Jul 05 '18 at 02:57

Adhish Thite

Are you not able to use Tensorboard as well? – Danny Feb 05 '19 at 11:56

Monitor Model Training Progress over HPC Clusters

0 Answers0