Questions tagged [preprocessing]

Data preprocessing is a data mining technique that involves transforming raw data into a better understandable or more useful format.

Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing or predictive modeling.

525 questions
56
votes
4 answers

Difference between OrdinalEncoder and LabelEncoder

I was going through the official documentation of scikit-learn learn after going through a book on ML and came across the following thing: In the Documentation it is given about sklearn.preprocessing.OrdinalEncoder() whereas in the book it was given…
48
votes
3 answers

StandardScaler before or after splitting data - which is better?

When I was reading about using StandardScaler, most of the recommendations were saying that you should use StandardScaler before splitting the data into train/test, but when i was checking some of the codes posted online (using sklearn) there were…
tsumaranaina
  • 695
  • 1
  • 6
  • 17
41
votes
2 answers

How to prepare/augment images for neural network?

I would like to use a neural network for image classification. I'll start with pre-trained CaffeNet and train it for my application. How should I prepare the input images? In this case, all the images are of the same object but with variations…
25
votes
4 answers

Different Test Set and Training Set Distribution

I am working on a data science competition for which the distribution of my test set is different from the training set. I want to subsample observations from training set which closely resembles test set. How can I do this?
Pooja
  • 251
  • 1
  • 3
  • 3
22
votes
2 answers

Loading own train data and labels in dataloader using pytorch?

I have x_data and labels separately. How can I combine and load them in the model using torch.utils.data.DataLoader? I have a dataset that I created and the training data has 20k samples and the labels are also separate. Lets say I want to load a…
Amarnath
  • 351
  • 1
  • 2
  • 5
21
votes
3 answers

Image resizing and padding for CNN

I want to train a CNN for image recognition. Images for training have not fixed size. I want the input size for the CNN to be 50x100 (height x width), for example. When I resize some small sized images (for example 32x32) to input size, the content…
16
votes
2 answers

One Hot Encoding vs Word Embedding - When to choose one or another?

A colleague of mine is having an interesting situation, he has quite a large set of possibilities for a defined categorical feature (+/- 300 different values) The usual data science approach would be to perform a One-Hot Encoding. However, wouldn't…
14
votes
2 answers

Preprocessing for Text Classification in Transformer Models (BERT variants)

This might be silly to ask, but I am wondering if one should carry out the conventional text preprocessing steps for training one of the transformer models? I remember for training a Word2Vec or Glove, we needed to perform an extensive text cleaning…
TwinPenguins
  • 4,157
  • 3
  • 17
  • 53
12
votes
5 answers

Please review my sketch of the Machine Learning process

It's amazingly difficult to find an outline of the end-to-end machine learning process. As a total beginner, this lack of information is frustrating, so I decided to try scraping together my own process by looking at a lot of tutorials that all do…
11
votes
1 answer

Data preprocessing: Should we normalise images pixel-wise?

Let me present you with a toy example and a reasoning on image normalisation I had: Suppose we have a CNN architecture to classify NxN grayscale images in two categories. Pixel values range from 0 (black) to 255 (white). Class 0: Images that…
9
votes
2 answers

Effect of Stop-Word Removal on Transformers for Text Classification

The domain here is essentially topic classification, so not necessarily a problem where stop-words have an impact on the analysis (as opposed to, say, sentiment analysis where structure can affect meaning). With respect to the positional encoding…
9
votes
1 answer

How to approach the numer.ai competition with anonymous scaled numerical predictors?

Numer.ai has been around for a while now and there seem to be only few posts or other discussions about it on the web. The system has changed from time to time and the set-up today is the following: train (N=96K) and test (N=33K) data with 21…
8
votes
1 answer

Encoding with OrdinalEncoder : how to give levels as user input?

I am trying to do ordinal encoding using: from sklearn.preprocessing import OrdinalEncoder I will try to explain my problem with a simple dataset. X = pd.DataFrame({'animals':['low','med','low','high','low','high']}) enc =…
8
votes
1 answer

sklearn SimpleImputer too slow for categorical data represented as string values

I have a data set with categorical features represented as string values and I want to fill-in missing values in it. I’ve tried to use sklearn’s SimpleImputer but it takes too much time to fulfill the task as compared to pandas. Both methods produce…
vlc146543
  • 83
  • 1
  • 4
8
votes
5 answers

issue with oneHotEncoding

So i have a PandasDataFrame with categorical variables in a column which i want to one hot encode i've used the following code from an ML udemy course from sklearn.preprocessing import…
Iltl
  • 253
  • 1
  • 3
  • 7
1
2 3
34 35