Using timm's implementation of Swin Transformer, how does one generate an embedding vector?
I would like to use timm's SwinTransformer class to generate an embedding vector for use with metric learning (sub-center ArcFace).
What I've tried:
To create the SwinTransformer I have something like:
from timm import create_model
backbone_name = 'swin_large_patch4_window7_224'
EMBEDDING_SIZE = 128
NUM_CLASSES = EMBEDDING_SIZE # ???
backbone = create_model(backbone_name, pretrained=True, num_classes=NUM_CLASSES)
This results in backbone having the structure:
SwinTransformer(
(patch_embed): PatchEmbed(
(proj): Conv2d(3, 192, kernel_size=(4, 4), stride=(4, 4))
(norm): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
)
(pos_drop): Dropout(p=0.0, inplace=False)
(layers): Sequential(
... *SNIP SNIP SNIP* ...
)
(norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
(avgpool): AdaptiveAvgPool1d(output_size=1)
(head): Linear(in_features=1536, out_features=128, bias=True)
)
So, I did what I normally do, which is freeze the backbone and replace the timm's head with a 3-layer multilayer perceptron.
The "smoke test" of trying to train this was really poor, but I haven't started exploring training parameters, MLP layer sizes, etc.
Looking at the SwinTransformer architecture, my intuition is that the weights going into head are part of the "decoder" phase, when I presumably want the output of the "encoder" phase. (If decoder == Features -> ImageNetCategory, I suppose.)
Is there a layer in the
SwinTransformerwhere I can grab the activations and put them into a simple MLP embedding projector?Should I just use the
SwinTransformeras-is, no layers frozen, and train towards the output ofheadbeing an embedding? Won't that be very inefficient?