Why am I getting good validation scores, but poor test scores in Kaggle competition

Question

I am participating in a Kaggle multiclass classification competition. The submissions will be scored based on the 'logloss' score. I am using Keras and Scikit libraries and a deep learning network model and have taken the below approach.

I have corrected class imbalance in the training data using oversampling the minority classes. I have split the training data into training (X_train, y_train) and validation datasets (X_test, y_test). I have scaled the features and I have done categorical encoding of labels.

When I run the model, I am getting very good Validation loss (1.708) and Validation accuracy (compared to Kaggle leaderboard scores; top logloss score is 1.744), but when I submit my predicted probabilities for different classes for the test_set, I am getting awfully high loss score (4+) (It is a different matter I got a different, decent score - 2.02, using a different model approach, which is reflected in the leaderboard).

Why is this? Any suggestions on what should be done or where I am going wrong?

total classes: 

Class_3    51811
Class_7    51811
Class_2    51811
Class_5    51811
Class_1    51811
Class_9    51811
Class_6    51811
Class_8    51811
Class_4    51811
Name: target, dtype: int64
466299

X_train, X_test, y_train, y_test = tts(X, y,test_size =.3, stratify=y, random_state=9)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(326409, 75)
(326409, 9)
(139890, 75)
(139890, 9)

display(X_train.head(3))
display(X_test.head(3))
display(y_train[:3])
display(y_test[:3])

    feature_0   feature_1   feature_2   feature_3   feature_4   feature_5   feature_6   feature_7   feature_8   feature_9   ...     feature_65  feature_66  feature_67  feature_68  feature_69  feature_70  feature_71  feature_72  feature_73  feature_74
425643  0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   3   0   1   0   0   0
303754  2   3   2   2   5   0   0   1   1   1   ...     1   0   0   0   0   0   0   4   6   0
80710   2   8   2   0   18  2   0   2   1   3   ...     0   0   4   1   0   3   0   0   1   0

3 rows × 75 columns
    feature_0   feature_1   feature_2   feature_3   feature_4   feature_5   feature_6   feature_7   feature_8   feature_9   ...     feature_65  feature_66  feature_67  feature_68  feature_69  feature_70  feature_71  feature_72  feature_73  feature_74
300226  0   0   1   4   0   0   0   4   1   1   ...     1   0   1   0   0   1   0   0   2   2
124793  0   0   0   6   0   0   0   3   7   2   ...     0   0   0   0   0   0   0   0   0   0
439437  0   3   0   0   5   0   0   2   1   1   ...     2   0   0   0   3   0   4   0   0   0

3 rows × 75 columns

array([[0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

array([[0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1.]], dtype=float32)

print(X_train.index.isin(X_test.index).sum())
print(X_test.index.isin(X_train.index).sum())
0
0

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
test_set = scaler.fit_transform(test_set)

from keras.optimizers import Adam
from tensorflow.keras import layers

model = Sequential()
model.add(Dense(1024, input_shape=(75,), activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(9, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=.001), metrics=['accuracy'], )

from tensorflow.keras.callbacks import EarlyStopping
monitor_val_acc = EarlyStopping(monitor='val_loss', patience=5)
model.fit(X_train, y_train, epochs = 50, validation_split=.3, callbacks= [monitor_val_acc], batch_size=1024)
accuracy = model.evaluate(X_test, y_test)[1]
print('Accuracy:', accuracy)

............
Epoch 28/30
45/45 [==============================] - 5s 117ms/step - loss: 1.6676 - accuracy: 0.3626 - val_loss: 1.7675 - val_accuracy: 0.3333
Epoch 29/30
45/45 [==============================] - 5s 114ms/step - loss: 1.6140 - accuracy: 0.3809 - val_loss: 1.7815 - val_accuracy: 0.3357
Epoch 30/30
45/45 [==============================] - 5s 117ms/step - loss: 1.5942 - accuracy: 0.3869 - val_loss: 1.7126 - val_accuracy: 0.3563
4372/4372 [==============================] - 11s 2ms/step - loss: 1.7085 - accuracy: 0.3582
Accuracy: 0.3581957221031189

from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
preds_val = model.predict(X_test)

preds_val[:3]
array([[1.13723904e-01, 5.20741269e-02, 4.70720865e-02, 1.59640312e-02,
        1.92086305e-02, 2.25828230e-01, 1.81854114e-01, 1.99746847e-01,
        1.44528091e-01],
       [6.04994688e-03, 1.40825182e-01, 9.95656699e-02, 5.96038415e-04,
        5.59030111e-09, 4.57442701e-02, 3.05081338e-01, 1.77178025e-01,
        2.24959582e-01],
       [6.54266328e-02, 9.87399742e-02, 1.07230745e-01, 1.46904245e-01,
        6.80148089e-03, 1.52257413e-01, 1.22348621e-01, 1.58026025e-01,
        1.42264828e-01]], dtype=float32)

log_loss(y_test, preds_val)
1.708450169537806

Maybe your training/validation split is made in a way that it is leaking. You should try to split them in an analogous way to how the gold test data would be split by the competition organizers. For instance, in speech tasks, you may want not to mix the same speakers in the different splits. — noe, Jun 14 '21 at 11:13
@noe, I don't think so. the below commands shown in the question prove it. right??? print(X_train.index.isin(X_test.index).sum()) print(X_test.index.isin(X_train.index).sum()). Both are zeros. — Srinivas, Jun 14 '21 at 11:17
Leaking is more than having the exact data in both sets. That's what I was trying to illustrate with my example of the speaker split. You should question if, for the specific data domain, you should perform the split based on some feature value. — noe, Jun 14 '21 at 12:04
@noe, Apologies. I think I am missing on your suggestion. Can you please provide me with a link to details on speaker split so that I can better understand. Then, probably I can understand where is the problem. Thank you. — Srinivas, Jun 14 '21 at 12:19
I suggest you take a look at the [Wikipedia page of leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)), which contains sensible explanations and examples. — noe, Jun 14 '21 at 12:31
It looks to me like you're resampling the whole dataset before splitting, am I right? If yes that can certainly explain the problem: resampling should be applied only on the training set (see for instance [this question](https://datascience.stackexchange.com/q/60764/64377)) — Erwan, Jun 14 '21 at 17:56
Erwan, you are right. I have realised that after @noe's suggestion. I think that explains why I get high validation score and not on testset. Thank you. — Srinivas, Jun 15 '21 at 12:38

Why am I getting good validation scores, but poor test scores in Kaggle competition

0 Answers0