I am following keras example to classify time series using transformers. Timeseries classification with a Transformer model The creation of the model is presented in the following code snippet:
def transformer_encoder(inputs):
# Normalization and Attention
x = layers.LayerNormalization(epsilon=1e-6)(inputs)
x = layers.MultiHeadAttention(key_dim=256, num_heads=4, dropout=0.25)(x, x)
x = layers.Dropout(0.25)(x)
res = x + inputs
# Feed Forward Part
x = layers.LayerNormalization(epsilon=1e-6)(res)
x = layers.Conv1D(filters=4, kernel_size=1, activation="relu")(x)
x = layers.Dropout(0.25)(x)
x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
return x + res
def build_model(input_shape,num_transformer_block):
inputs = keras.Input(shape=input_shape)
x = inputs
for _ in range(num_transformer_blocks):
x = transformer_encoder(x)
x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.4)(x)
outputs = layers.Dense(n_classes, activation="softmax")(x)
return keras.Model(inputs, outputs)
I am using a different dataset and the shape of my data is different too. Also my data is not in fixed length, so I am trying to add masking to my model to ignore the missing time steps.
As for now I tried few options but none of them worked. I tried to add a Masking layer after the input layer:
def build_model(input_shape,num_transformer_block):
inputs = keras.Input(shape=input_shape)
x = Masking(input)
...
And I also tried to compute the masking of the data manually and pass it as the attention_mask argument in the MultiHeadAttention
To verify that the masking didn't succeed I replaced all the values in my data set to a constant number (e.g 500) and the model could still classify the data to the correct classes, I think it learned the length of the padding.
The shape of my data is with padding is (15, 7)
How do I apply the padding on this model correctly?