Questions tagged [sampling]
186 questions
53
votes
2 answers
train_test_split() error: Found input variables with inconsistent numbers of samples
Fairly new to Python but building out my first RF model based on some classification data. I've converted all of the labels into int64 numerical data and loaded into X and Y as a numpy array, but I am hitting an error when I am trying to train the…
josh_gray
- 633
- 1
- 5
- 4
44
votes
5 answers
Intuitive explanation of Noise Contrastive Estimation (NCE) loss?
I read about NCE (a form of candidate sampling) from these two sources:
Tensorflow writeup
Original Paper
Can someone help me with the following:
A simple explanation of how NCE works (I found the above difficult to parse and get an understanding…
tejaskhot
- 3,935
- 7
- 20
- 18
17
votes
3 answers
With unbalanced class, do I have to use under sampling on my validation/testing datasets?
I’m a beginner in machine learning and I’m facing a situation. I’m working on a Real Time Bidding problem, with the IPinYou dataset and I’m trying to do a click prediction.
The thing is that, as you may know, the dataset is very unbalanced : Around…
jmvllt
- 619
- 1
- 8
- 15
17
votes
3 answers
When should we consider a dataset as imbalanced?
I'm facing a situation where the numbers of positive and negative examples in a dataset are imbalanced.
My question is, are there any rules of thumb that tell us when we should subsample the large category in order to force some kind of balancing in…
Rami
- 594
- 1
- 5
- 16
16
votes
1 answer
How many features to sample using Random Forests
The Wikipedia page which quotes "The Elements of Statistical Learning" says:
Typically, for a classification problem with $p$ features, $\lfloor \sqrt{p}\rfloor$ features are used in each split.
I understand that this is a fairly good educated…
Valentin Calomme
- 5,396
- 3
- 20
- 49
15
votes
2 answers
Why do we need to handle data imbalance?
I would like to know why we need to deal with data imbalance. I know how to deal with it and different methods to solve the issue - by up sampling or down sampling or by using SMOTE.
For example, if I have a rare disease 1 percent out of 100, and…
sara
- 481
- 7
- 15
15
votes
1 answer
Is stratified sampling necessary (random forest, Python)?
I use Python to run a random forest model on my imbalanced dataset (the target variable was a binary class). When splitting the training and testing dataset, I struggled whether to used stratified sampling (like the code shown) or not. So far, I…
LUSAQX
- 783
- 2
- 10
- 24
13
votes
2 answers
Cross-validation: K-fold vs Repeated random sub-sampling
I wonder which type of model cross-validation to choose for classification problem: K-fold or random sub-sampling (bootstrap sampling)?
My best guess is to use 2/3 of the data set (which is ~1000 items) for training and 1/3 for validation.
In this…
IgorS
- 5,444
- 11
- 31
- 43
12
votes
1 answer
Cross validation for highly imbalanced data with undersampling
In my problem, I am dealing with a highly imbalanced data set, say for every positive class there are 10000 negative one. A normal starting method to train a model is to undersample the data. In this procedure, it is very important to train our…
Amin Kiany
- 223
- 2
- 6
10
votes
2 answers
How are samples selected from training data in Xgboost
In Random Forest, each tree is not fed with the full batch of training data, only a sample.
How does this work for Xgboost? If this sampling happens as well, how does it work for this ML algorithm?
Aman Raparia
- 257
- 2
- 8
8
votes
1 answer
Why gradient boosting uses sampling without replacement?
In Random Forest each tree is built selecting a sample with replacement (bootstrap). And I assumed that Gradient Boosting's trees were selected with the same sampling technique. (@BenReiniger corrected me). Here there are the sampling techniques…
Carlos Mougan
- 6,011
- 2
- 15
- 45
7
votes
3 answers
Why did sampling boost the performance of my model?
I have an imbalanced dataset with 88 positive samples and 128575 negative samples. I was reluctant to over/undersample the data since it's a biological dataset and I didn't want to introduce synthetic data. I built a Random Forest Classifier with…
Senthamizhan
- 73
- 5
6
votes
2 answers
Is sampling a valid way to reduce complexity?
I'm facing an issue where I have a massive amount of data that I need to cluster. As we know, clustering algorithms can have a very high O complexity, and I'm looking for ways to reduce the time my algorithm is running.
I want to try a few different…
lte__
- 1,310
- 5
- 18
- 26
6
votes
1 answer
How to define a custom resampling methodology
I'm using an experimental design to test the robustness of different classification methods, and now I'm searching for the correct definition of such design.
I'm creating different subsets of the full dataset by cutting away some samples. Each…
gc5
- 879
- 2
- 9
- 17
6
votes
1 answer
Decision trees, categorizacion and oversampling
I want to create a model to predict the propensity to buy a certain product. As my proportion of 1's is very low, I decided to apply oversampling (to get a 10% of 1's and a 90% of 0's).
Now, I want to discretize some of the variables. To do so I run…
Elisa
- 81
- 1