Questions tagged [sampling]

186 questions
53
votes
2 answers

train_test_split() error: Found input variables with inconsistent numbers of samples

Fairly new to Python but building out my first RF model based on some classification data. I've converted all of the labels into int64 numerical data and loaded into X and Y as a numpy array, but I am hitting an error when I am trying to train the…
josh_gray
  • 633
  • 1
  • 5
  • 4
44
votes
5 answers

Intuitive explanation of Noise Contrastive Estimation (NCE) loss?

I read about NCE (a form of candidate sampling) from these two sources: Tensorflow writeup Original Paper Can someone help me with the following: A simple explanation of how NCE works (I found the above difficult to parse and get an understanding…
tejaskhot
  • 3,935
  • 7
  • 20
  • 18
17
votes
3 answers

With unbalanced class, do I have to use under sampling on my validation/testing datasets?

I’m a beginner in machine learning and I’m facing a situation. I’m working on a Real Time Bidding problem, with the IPinYou dataset and I’m trying to do a click prediction. The thing is that, as you may know, the dataset is very unbalanced : Around…
jmvllt
  • 619
  • 1
  • 8
  • 15
17
votes
3 answers

When should we consider a dataset as imbalanced?

I'm facing a situation where the numbers of positive and negative examples in a dataset are imbalanced. My question is, are there any rules of thumb that tell us when we should subsample the large category in order to force some kind of balancing in…
Rami
  • 594
  • 1
  • 5
  • 16
16
votes
1 answer

How many features to sample using Random Forests

The Wikipedia page which quotes "The Elements of Statistical Learning" says: Typically, for a classification problem with $p$ features, $\lfloor \sqrt{p}\rfloor$ features are used in each split. I understand that this is a fairly good educated…
15
votes
2 answers

Why do we need to handle data imbalance?

I would like to know why we need to deal with data imbalance. I know how to deal with it and different methods to solve the issue - by up sampling or down sampling or by using SMOTE. For example, if I have a rare disease 1 percent out of 100, and…
sara
  • 481
  • 7
  • 15
15
votes
1 answer

Is stratified sampling necessary (random forest, Python)?

I use Python to run a random forest model on my imbalanced dataset (the target variable was a binary class). When splitting the training and testing dataset, I struggled whether to used stratified sampling (like the code shown) or not. So far, I…
LUSAQX
  • 783
  • 2
  • 10
  • 24
13
votes
2 answers

Cross-validation: K-fold vs Repeated random sub-sampling

I wonder which type of model cross-validation to choose for classification problem: K-fold or random sub-sampling (bootstrap sampling)? My best guess is to use 2/3 of the data set (which is ~1000 items) for training and 1/3 for validation. In this…
IgorS
  • 5,444
  • 11
  • 31
  • 43
12
votes
1 answer

Cross validation for highly imbalanced data with undersampling

In my problem, I am dealing with a highly imbalanced data set, say for every positive class there are 10000 negative one. A normal starting method to train a model is to undersample the data. In this procedure, it is very important to train our…
10
votes
2 answers

How are samples selected from training data in Xgboost

In Random Forest, each tree is not fed with the full batch of training data, only a sample. How does this work for Xgboost? If this sampling happens as well, how does it work for this ML algorithm?
Aman Raparia
  • 257
  • 2
  • 8
8
votes
1 answer

Why gradient boosting uses sampling without replacement?

In Random Forest each tree is built selecting a sample with replacement (bootstrap). And I assumed that Gradient Boosting's trees were selected with the same sampling technique. (@BenReiniger corrected me). Here there are the sampling techniques…
7
votes
3 answers

Why did sampling boost the performance of my model?

I have an imbalanced dataset with 88 positive samples and 128575 negative samples. I was reluctant to over/undersample the data since it's a biological dataset and I didn't want to introduce synthetic data. I built a Random Forest Classifier with…
6
votes
2 answers

Is sampling a valid way to reduce complexity?

I'm facing an issue where I have a massive amount of data that I need to cluster. As we know, clustering algorithms can have a very high O complexity, and I'm looking for ways to reduce the time my algorithm is running. I want to try a few different…
lte__
  • 1,310
  • 5
  • 18
  • 26
6
votes
1 answer

How to define a custom resampling methodology

I'm using an experimental design to test the robustness of different classification methods, and now I'm searching for the correct definition of such design. I'm creating different subsets of the full dataset by cutting away some samples. Each…
gc5
  • 879
  • 2
  • 9
  • 17
6
votes
1 answer

Decision trees, categorizacion and oversampling

I want to create a model to predict the propensity to buy a certain product. As my proportion of 1's is very low, I decided to apply oversampling (to get a 10% of 1's and a 90% of 0's). Now, I want to discretize some of the variables. To do so I run…
Elisa
  • 81
  • 1
1
2 3
12 13