Questions tagged [cross-validation]

Refers to general procedures that attempt to determine the generalizability of a statistical result. Cross-validation arises frequently in the context of assessing how a particular model fit predicts future observations. Methods for cross-validation usually involve withholding a random subset of the data during model fitting and quantifying how accurate the withheld data are predicted and repeating this process to get a measure of prediction accuracy.

639 questions
191
votes
16 answers

Train/Test/Validation Set Splitting in Sklearn

How could I randomly split a data matrix and the corresponding label vector into a X_train, X_test, X_val, y_train, y_test, y_val with scikit-learn? As far as I know, sklearn.model_selection.train_test_split is only capable of splitting into two not…
Hendrik
  • 8,377
  • 17
  • 40
  • 55
52
votes
3 answers

What is the difference between bootstrapping and cross-validation?

I used to apply K-fold cross-validation for robust evaluation of my machine learning models. But I'm aware of the existence of the bootstrapping method for this purpose as well. However, I cannot see the main difference between them in terms of…
Fredrik
  • 967
  • 2
  • 9
  • 11
46
votes
2 answers

How does the validation_split parameter of Keras' fit function work?

Validation-split in Keras Sequential model fit function is documented as following on https://keras.io/models/sequential/ : validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set…
rnso
  • 1,558
  • 3
  • 16
  • 34
38
votes
2 answers

Why use both validation set and test set?

Consider a neural network: For a given set of data, we divide it into training, validation and test set. Suppose we do it in the classic 60:20:20 ratio, then we prevent overfitting by validating the network by checking it on validation set. Then…
user1825567
  • 1,336
  • 1
  • 12
  • 22
37
votes
2 answers

How to use the output of GridSearch?

I'm currently working with Python and Scikit learn for classification purposes, and doing some reading around GridSearch I thought this was a great way for optimising my estimator parameters to get the best results. My methodology is this: Split my…
Dan Carter
  • 1,732
  • 1
  • 11
  • 26
35
votes
3 answers

Does modeling with Random Forests require cross-validation?

As far as I've seen, opinions tend to differ about this. Best practice would certainly dictate using cross-validation (especially if comparing RFs with other algorithms on the same dataset). On the other hand, the original source states that the…
neuron
  • 664
  • 1
  • 6
  • 9
35
votes
6 answers

Merging multiple data frames row-wise in PySpark

I have 10 data frames pyspark.sql.dataframe.DataFrame, obtained from randomSplit as (td1, td2, td3, td4, td5, td6, td7, td8, td9, td10) = td.randomSplit([.1, .1, .1, .1, .1, .1, .1, .1, .1, .1], seed = 100) Now I want to join 9 td's into a single…
krishna Prasad
  • 1,147
  • 1
  • 14
  • 23
31
votes
2 answers

How to calculate the fold number (k-fold) in cross validation?

I am confused about how I choose the number of folds (in k-fold CV) when I apply cross validation to check the model. Is it dependent on data size or other parameters?
Taimur Islam
  • 901
  • 4
  • 11
  • 17
27
votes
4 answers

Cross validation Vs. Train Validate Test

I have a doubt regarding the cross validation approach and train-validation-test approach. I was told that I can split a dataset into 3 parts: Train: we train the model. Validation: we validate and adjust model parameters. Test: never seen before…
NaveganTeX
  • 445
  • 1
  • 4
  • 9
18
votes
3 answers

What is the proper way to use early stopping with cross-validation?

I am not sure what is the proper way to use early stopping with cross-validation for a gradient boosting algorithm. For a simple train/valid split, we can use the valid dataset as the evaluation dataset for the early stopping and when refitting we…
Amine SOUIKI
  • 181
  • 1
  • 4
15
votes
2 answers

Can overfitting occur even with validation loss still dropping?

I have a convolutional + LSTM model in Keras, similar to this (ref 1), that I am using for a Kaggle contest. Architecture is shown below. I have trained it on my labeled set of 11000 samples (two classes, initial prevalence is ~9:1, so I upsampled…
DeusXMachina
  • 263
  • 1
  • 2
  • 6
15
votes
3 answers

How to choose a classifier after cross-validation?

When we do k-fold cross validation, should we just use the classifier that has the highest test accuracy? What is generally the best approach in getting a classifier from cross validation?
Armon Safai
  • 419
  • 1
  • 5
  • 12
14
votes
1 answer

Stratify on regression

I have worked in classification problems, and stratified cross-validation is one of the most useful and simple techniques I've found. In that case, what it means is to build a training and validation set that have the same prorportions of classes of…
David Masip
  • 5,981
  • 2
  • 23
  • 61
13
votes
2 answers

Cross-validation: K-fold vs Repeated random sub-sampling

I wonder which type of model cross-validation to choose for classification problem: K-fold or random sub-sampling (bootstrap sampling)? My best guess is to use 2/3 of the data set (which is ~1000 items) for training and 1/3 for validation. In this…
IgorS
  • 5,444
  • 11
  • 31
  • 43
13
votes
2 answers

Validation vs. test vs. training accuracy. Which one should I compare for claiming overfit?

I have read on the several answers here and on the Internet that cross-validation helps to indicate that if the model will generalize well or not and about overfitting. But I am confused that which two accuracies/errors amoung…
A.B
  • 316
  • 1
  • 3
  • 12
1
2 3
42 43