Highest Voted 'preprocessing' Questions - Data Science Stack Exchange

56

votes

4 answers

Difference between OrdinalEncoder and LabelEncoder

I was going through the official documentation of scikit-learn learn after going through a book on ML and came across the following thing: In the Documentation it is given about sklearn.preprocessing.OrdinalEncoder() whereas in the book it was given…

asked Oct 07 '18 at 18:55

Saurabh Singh

693
1
6
8

48

votes

3 answers

StandardScaler before or after splitting data - which is better?

When I was reading about using StandardScaler, most of the recommendations were saying that you should use StandardScaler before splitting the data into train/test, but when i was checking some of the codes posted online (using sklearn) there were…

machine-learning scikit-learn preprocessing

asked Sep 18 '18 at 02:35

tsumaranaina

695
1
6
17

41

votes

2 answers

How to prepare/augment images for neural network?

I would like to use a neural network for image classification. I'll start with pre-trained CaffeNet and train it for my application. How should I prepare the input images? In this case, all the images are of the same object but with variations…

neural-network image-classification convolutional-neural-network preprocessing

asked Feb 24 '15 at 11:59

Alex I

3,142
1
21
27

25

votes

4 answers

Different Test Set and Training Set Distribution

I am working on a data science competition for which the distribution of my test set is different from the training set. I want to subsample observations from training set which closely resembles test set. How can I do this?

preprocessing

asked Feb 26 '18 at 20:29

Pooja

251
1
3
3

22

votes

2 answers

Loading own train data and labels in dataloader using pytorch?

I have x_data and labels separately. How can I combine and load them in the model using torch.utils.data.DataLoader? I have a dataset that I created and the training data has 20k samples and the labels are also separate. Lets say I want to load a…

python dataset preprocessing pytorch

asked Feb 20 '19 at 21:13

Amarnath

351
1
2
5

21

votes

3 answers

Image resizing and padding for CNN

I want to train a CNN for image recognition. Images for training have not fixed size. I want the input size for the CNN to be 50x100 (height x width), for example. When I resize some small sized images (for example 32x32) to input size, the content…

machine-learning deep-learning image-classification preprocessing image-recognition

asked Apr 25 '18 at 13:46

Odgiiv

333
1
2
7

16

votes

2 answers

One Hot Encoding vs Word Embedding - When to choose one or another?

A colleague of mine is having an interesting situation, he has quite a large set of possibilities for a defined categorical feature (+/- 300 different values) The usual data science approach would be to perform a One-Hot Encoding. However, wouldn't…

preprocessing word-embeddings embeddings encoding

asked Apr 03 '18 at 14:13

Jonathan DEKHTIAR

590
2
5
10

14

votes

2 answers

Preprocessing for Text Classification in Transformer Models (BERT variants)

This might be silly to ask, but I am wondering if one should carry out the conventional text preprocessing steps for training one of the transformer models? I remember for training a Word2Vec or Glove, we needed to perform an extensive text cleaning…

python nlp preprocessing bert transformer

asked Nov 08 '19 at 06:28

TwinPenguins

4,157
3
17
53

12

votes

5 answers

Please review my sketch of the Machine Learning process

It's amazingly difficult to find an outline of the end-to-end machine learning process. As a total beginner, this lack of information is frustrating, so I decided to try scraping together my own process by looking at a lot of tutorials that all do…

machine-learning data-cleaning preprocessing data-imputation

asked Apr 06 '20 at 01:10

rocksNwaves

309
1
10

11

votes

1 answer

Data preprocessing: Should we normalise images pixel-wise?

Let me present you with a toy example and a reasoning on image normalisation I had: Suppose we have a CNN architecture to classify NxN grayscale images in two categories. Pixel values range from 0 (black) to 255 (white). Class 0: Images that…

machine-learning deep-learning image-classification preprocessing computer-vision

asked Jan 21 '18 at 11:56

lucasrodesg

235
2
7

9

votes

2 answers

Effect of Stop-Word Removal on Transformers for Text Classification

The domain here is essentially topic classification, so not necessarily a problem where stop-words have an impact on the analysis (as opposed to, say, sentiment analysis where structure can affect meaning). With respect to the positional encoding…

nlp preprocessing transfer-learning transformer text-classification

asked Dec 03 '20 at 20:24

Andy

650
4
13

9

votes

1 answer

How to approach the numer.ai competition with anonymous scaled numerical predictors?

Numer.ai has been around for a while now and there seem to be only few posts or other discussions about it on the web. The system has changed from time to time and the set-up today is the following: train (N=96K) and test (N=33K) data with 21…

machine-learning deep-learning cross-validation preprocessing competitions

asked Jun 29 '16 at 16:11

Richi W

165
2
11

8

votes

1 answer

Encoding with OrdinalEncoder : how to give levels as user input?

I am trying to do ordinal encoding using: from sklearn.preprocessing import OrdinalEncoder I will try to explain my problem with a simple dataset. X = pd.DataFrame({'animals':['low','med','low','high','low','high']}) enc =…

machine-learning scikit-learn data-cleaning preprocessing encoding

asked Apr 15 '20 at 00:25

Ayush Ranjan

401
1
4
13

8

votes

1 answer

sklearn SimpleImputer too slow for categorical data represented as string values

I have a data set with categorical features represented as string values and I want to fill-in missing values in it. I’ve tried to use sklearn’s SimpleImputer but it takes too much time to fulfill the task as compared to pandas. Both methods produce…

python scikit-learn pandas preprocessing

asked Jan 07 '20 at 12:43

vlc146543

83
1
4

8

votes

5 answers

issue with oneHotEncoding

So i have a PandasDataFrame with categorical variables in a column which i want to one hot encode i've used the following code from an ML udemy course from sklearn.preprocessing import…

python scikit-learn preprocessing

asked Oct 18 '17 at 19:40

Iltl

253
1
3
7

Questions tagged [preprocessing]