0

I am new to knowledge distillation. I have read the paper, and I understand that it works by minimising the KL divergence of the probability distribution output of the teacher and student network, (the output before applying the sigmoid function).

Now I want to distill a model to make it lighter. The model produces some outputs using encode and decoder, and then based on those output it decides the markers and energy functions of a watershed model, and the final segmentation of the model is produced by watershed process.

To distil this model, I need the output distribution of the model, which is not available from watershed, as it is definite and not a probability distribution from different classes. I can produce a distribution using one hot coding, and minimise KL divergence between the teacher and student.

Can someone give me any insight if this will work?

Otherwise I can distil the encoder and decoder portion, and keep watershed portion as it is, but I prefer a single network with a few layers, without the watershed portion.

Please let me know your views.

0 Answers0