Highest Voted 'sampling' Questions - Data Science Stack Exchange

53

votes

2 answers

train_test_split() error: Found input variables with inconsistent numbers of samples

Fairly new to Python but building out my first RF model based on some classification data. I've converted all of the labels into int64 numerical data and loaded into X and Y as a numpy array, but I am hitting an error when I am trying to train the…

python scikit-learn sampling

asked Jul 06 '17 at 05:17

josh_gray

633
1
5
4

44

votes

5 answers

Intuitive explanation of Noise Contrastive Estimation (NCE) loss?

I read about NCE (a form of candidate sampling) from these two sources: Tensorflow writeup Original Paper Can someone help me with the following: A simple explanation of how NCE works (I found the above difficult to parse and get an understanding…

deep-learning tensorflow word-embeddings sampling loss-function

asked Aug 05 '16 at 03:36

tejaskhot

3,935
7
20
18

17

votes

3 answers

With unbalanced class, do I have to use under sampling on my validation/testing datasets?

I’m a beginner in machine learning and I’m facing a situation. I’m working on a Real Time Bidding problem, with the IPinYou dataset and I’m trying to do a click prediction. The thing is that, as you may know, the dataset is very unbalanced : Around…

machine-learning dataset sampling

asked Nov 18 '15 at 20:14

jmvllt

619
1
8
15

17

votes

3 answers

When should we consider a dataset as imbalanced?

I'm facing a situation where the numbers of positive and negative examples in a dataset are imbalanced. My question is, are there any rules of thumb that tell us when we should subsample the large category in order to force some kind of balancing in…

classification dataset sampling class-imbalance

asked May 16 '16 at 11:36

Rami

594
1
5
16

16

votes

1 answer

How many features to sample using Random Forests

The Wikipedia page which quotes "The Elements of Statistical Learning" says: Typically, for a classification problem with $p$ features, $\lfloor \sqrt{p}\rfloor$ features are used in each split. I understand that this is a fairly good educated…

statistics random-forest optimization model-evaluations sampling

asked Oct 10 '17 at 10:50

Valentin Calomme

5,396
3
20
49

15

votes

2 answers

Why do we need to handle data imbalance?

I would like to know why we need to deal with data imbalance. I know how to deal with it and different methods to solve the issue - by up sampling or down sampling or by using SMOTE. For example, if I have a rare disease 1 percent out of 100, and…

classification dataset sampling class-imbalance

asked Nov 06 '17 at 06:15

sara

481
7
15

15

votes

1 answer

Is stratified sampling necessary (random forest, Python)?

I use Python to run a random forest model on my imbalanced dataset (the target variable was a binary class). When splitting the training and testing dataset, I struggled whether to used stratified sampling (like the code shown) or not. So far, I…

machine-learning python random-forest sampling training

asked Jan 12 '17 at 00:58

LUSAQX

783
2
10
24

13

votes

2 answers

Cross-validation: K-fold vs Repeated random sub-sampling

I wonder which type of model cross-validation to choose for classification problem: K-fold or random sub-sampling (bootstrap sampling)? My best guess is to use 2/3 of the data set (which is ~1000 items) for training and 1/3 for validation. In this…

cross-validation sampling

asked Jun 20 '14 at 17:57

IgorS

5,444
11
31
43

12

votes

1 answer

Cross validation for highly imbalanced data with undersampling

In my problem, I am dealing with a highly imbalanced data set, say for every positive class there are 10000 negative one. A normal starting method to train a model is to undersample the data. In this procedure, it is very important to train our…

machine-learning scikit-learn cross-validation sampling class-imbalance

asked Feb 04 '19 at 16:32

Amin Kiany

223
2
6

10

votes

2 answers

How are samples selected from training data in Xgboost

In Random Forest, each tree is not fed with the full batch of training data, only a sample. How does this work for Xgboost? If this sampling happens as well, how does it work for this ML algorithm?

machine-learning decision-trees xgboost sampling

asked Jan 08 '20 at 09:32

Aman Raparia

257
2
8

8

votes

1 answer

Why gradient boosting uses sampling without replacement?

In Random Forest each tree is built selecting a sample with replacement (bootstrap). And I assumed that Gradient Boosting's trees were selected with the same sampling technique. (@BenReiniger corrected me). Here there are the sampling techniques…

machine-learning random-forest decision-trees xgboost sampling

asked Feb 07 '20 at 06:59

Carlos Mougan

6,011
2
15
45

7

votes

3 answers

Why did sampling boost the performance of my model?

I have an imbalanced dataset with 88 positive samples and 128575 negative samples. I was reluctant to over/undersample the data since it's a biological dataset and I didn't want to introduce synthetic data. I built a Random Forest Classifier with…

random-forest class-imbalance sampling

asked Sep 25 '19 at 17:00

Senthamizhan

73
5

6

votes

2 answers

Is sampling a valid way to reduce complexity?

I'm facing an issue where I have a massive amount of data that I need to cluster. As we know, clustering algorithms can have a very high O complexity, and I'm looking for ways to reduce the time my algorithm is running. I want to try a few different…

clustering sampling

asked Nov 08 '20 at 17:37

lte__

1,310
5
18
26

6

votes

1 answer

How to define a custom resampling methodology

I'm using an experimental design to test the robustness of different classification methods, and now I'm searching for the correct definition of such design. I'm creating different subsets of the full dataset by cutting away some samples. Each…

classification definitions accuracy sampling

asked Jul 10 '14 at 11:55

gc5

879
2
9
17

6

votes

1 answer

Decision trees, categorizacion and oversampling

I want to create a model to predict the propensity to buy a certain product. As my proportion of 1's is very low, I decided to apply oversampling (to get a 10% of 1's and a 90% of 0's). Now, I want to discretize some of the variables. To do so I run…

classification predictive-modeling sampling

asked Dec 03 '14 at 14:23

Elisa

81
1

Questions tagged [sampling]