I'm interested in using knowledge distillation to compress a large deep learning model to a smaller size so it will run on an embedded device. I've found a number of open source examples for the knowledge distillation part.
https://keras.io/examples/vision/knowledge_distillation/
The only examples I've found so far distill knowledge from a teacher to a student within a consistent framework. E.g. keras model with X topology and N1 parameters is used to train a second keras model with X topology and N2 parameters where N2 << N1.
The student model needs to more or less mimic the topology of the parent model while reducing the number of parameters. The objective function is a function of the combined student and teacher loss, modified by a temperature function (T=1 for student).
So the question is, could I train a student model from a teacher model where the deep learning frameworks are different? Let's say the teacher model is PyTorch and the student is a Tensorflow model. I don't really need the teacher DL framework for much other than the forward pass, and I'm only training the weights of the Student network. Seems like it should be straightforward but i wonder if the internal data types or other nuance that I'm unaware of would cause problems.