How to use a Swin Transformer with metric learning?

Question

Using timm's implementation of Swin Transformer, how does one generate an embedding vector?

I would like to use timm's SwinTransformer class to generate an embedding vector for use with metric learning (sub-center ArcFace).

What I've tried:

To create the SwinTransformer I have something like:

from timm import create_model

backbone_name = 'swin_large_patch4_window7_224'
EMBEDDING_SIZE = 128 
NUM_CLASSES = EMBEDDING_SIZE # ???  
backbone = create_model(backbone_name, pretrained=True, num_classes=NUM_CLASSES)

This results in backbone having the structure:

SwinTransformer(
  (patch_embed): PatchEmbed(
    (proj): Conv2d(3, 192, kernel_size=(4, 4), stride=(4, 4))
    (norm): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
  )
  (pos_drop): Dropout(p=0.0, inplace=False)
  (layers): Sequential(
    ... *SNIP SNIP SNIP* ...
  )
  (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
  (avgpool): AdaptiveAvgPool1d(output_size=1)
  (head): Linear(in_features=1536, out_features=128, bias=True)
)

So, I did what I normally do, which is freeze the backbone and replace the timm's head with a 3-layer multilayer perceptron.

The "smoke test" of trying to train this was really poor, but I haven't started exploring training parameters, MLP layer sizes, etc.

Looking at the SwinTransformer architecture, my intuition is that the weights going into head are part of the "decoder" phase, when I presumably want the output of the "encoder" phase. (If decoder == Features -> ImageNetCategory, I suppose.)

Is there a layer in the SwinTransformer where I can grab the activations and put them into a simple MLP embedding projector?
Should I just use the SwinTransformer as-is, no layers frozen, and train towards the output of head being an embedding? Won't that be very inefficient?

How to use a Swin Transformer with metric learning?

0 Answers0