1

I'm interested in using knowledge distillation to compress a large deep learning model to a smaller size so it will run on an embedded device. I've found a number of open source examples for the knowledge distillation part.

https://keras.io/examples/vision/knowledge_distillation/

https://towardsdatascience.com/model-distillation-and-compression-for-recommender-systems-in-pytorch-5d81c0f2c0ec

The only examples I've found so far distill knowledge from a teacher to a student within a consistent framework. E.g. keras model with X topology and N1 parameters is used to train a second keras model with X topology and N2 parameters where N2 << N1.

The student model needs to more or less mimic the topology of the parent model while reducing the number of parameters. The objective function is a function of the combined student and teacher loss, modified by a temperature function (T=1 for student).

So the question is, could I train a student model from a teacher model where the deep learning frameworks are different? Let's say the teacher model is PyTorch and the student is a Tensorflow model. I don't really need the teacher DL framework for much other than the forward pass, and I'm only training the weights of the Student network. Seems like it should be straightforward but i wonder if the internal data types or other nuance that I'm unaware of would cause problems.

Sledge
  • 254
  • 1
  • 9

0 Answers0