1

I split the dataset with

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

and the fit

from sklearn.metrics import log_loss
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)
score = log_loss(y_test, clf_probs)
print(score)

is final submission with

clf.fit(X,y) or clf.fit(X_train,y_train)???
slowmonk
  • 513
  • 1
  • 7
  • 16
  • 1
    Your final production model should use all available data since it will in general give you better model performance due to more data. Hence X, y. You should still report your model's performance based off your test set, however – aranglol Feb 11 '20 at 23:45
  • so clf.fit(X,y)? right? – slowmonk Feb 12 '20 at 00:05
  • 1
    please take a look at this post https://datascience.stackexchange.com/questions/33008/is-it-always-better-to-use-the-whole-dataset-to-train-the-final-model – pcko1 Mar 13 '20 at 14:38
  • 3
    Does this answer your question? [Is it always better to use the whole dataset to train the final model?](https://datascience.stackexchange.com/questions/33008/is-it-always-better-to-use-the-whole-dataset-to-train-the-final-model) – Igor F. Jul 11 '20 at 05:55

1 Answers1

0
  1. Training dataset is something that you use to train your dataset(data which the model learns upon).
  2. Validation dataset is that part of your data which is not used to train your model, but to check the model's performance and optimize it further. Here, as the model is optimized based on its performance on the validation dataset, so according to experts, your model though indirectly, has 'seen' the validation data as you might unknowingly create a bias towards the validation data by optimizing the model to perform better on the validation data only.
  3. The stage of testing the models, comes after all the optimization(by that time your model has processed (but not trained on) the validation data). So, once you think your model is optimized (you simply can't know it). Now you can and should use validation data also to train the model along with the training data.
  4. The testing data on the other hand is that part of your data which is completely unknown to model(that is neither it is used to train nor to validate the model). It is the completely unbiased performance check of your model.

Usually in case of competitions, the data you are given, since it has labels, it should be used to train and optimize you model. But testing should always be done only after the model has been trained on all the labeled data, that includes your training(X_train, y_train) and validation data(X_test, y_test).

Hence you should submit the prediction after seeing whole labeled data :- Hence clf.fit(X, Y)

I know this long explanation was not necessary, but one should know why you do what you do.

Hope it helps, thanks!!