Best way to scale across different datasets

Question

I have come across a peculiar situation when preprocessing data.

Let's say I have a dataset A. I split the dataset into A_train and A_test. I fit the A_train using any of the given scalers (sci-kit learn) and transform A_test with that scaler. Now training the neural network with A_train and validating on A_test works well. No overfitting and performance is good.

Let's say I have dataset B with the same features as in A, but with different ranges of values for the features. A simple example of A and B could be Boston and Paris housing datasets respectively. To test the performance of the above trained model on B, we transform B according to scaling attributes of A_train and then validate. This usually degrades performance, as this model is never shown the data from B.

The peculiar thing is if I fit and transform on B directly instead of using scaling attributes of A_train, the performance is a lot better. Usually, this reduces performance if I test this on A_test. In this scenario, it seems to work, although it's not right.

Since I work mostly on climate datasets, training on every dataset is not feasible. Therefore I would like to know the best way to scale such different datasets with the same features to get better performance.

Any ideas, please.

PS: I know training my model with more data can improve performance, but I am more interested in the right way of scaling. I tried removing outliers from datasets and applied QuantileTransformer, it improved performance but could be better.

score 1 · Answer 1 · answered May 03 '19 at 14:10

Your observation stems from two facts

Model performance deteriorates when data distribution changes (shifts). For example, when model is trained on Boston dataset but tested on Paris dataset, or trained on summer dataset but tested on winter dataset.
Model evaluation (wrongly) shows an improvement when information leaks from test set. For example, when test set is scaled with must-not-be-known statistics of the test set like (min, mean, std, max).

If a data set for other cities like Paris is not available at the training time (1), obviously there is nothing we can do. In real world cases, when we see a noticeable difference between validation score and test score, we hypothesize that the test set has a different distribution than the training set, thus, we try to incorporate the new data and re-train (probably not a full re-training in an online learning setting). Otherwise (2), we can

Combine all the cities into one dataset, then split it into train and test sets (later, validation set will be peaked from train set), i.e.
```
  train = A_train U B_train U ...
  test = A_test U B_test U ...
```
This way, the information from all cities will be used for scaling. We may under- or over-sample some cities to have a data set with equal representatives from each city.
Or, build a separate (specialized) model for each city (or a group of similar cites) which bypasses the "train on A_train, and test on B" problem all together.

This was my plan if I have to deploy my model. Releasing several versions by updating the training data and retrain. — raghu, May 04 '19 at 16:45

score 0 · Answer 2 · answered May 03 '19 at 14:40

0

Do you fit, transform, and evaluate on all of B, or do you split B into B_train and B_test, then scale based on B_train and evaluate on B_test?

There are plenty of options to try, but the best way to handle this is unfortunately going to depend on your data and the particular problem at hand.

answered May 03 '19 at 14:40

Matthew

1,254
7
12

I tried those options. The best results I got are with fitting on B_train and transform B_test and use A_train model on it. I guess there should be a better way to do it. – raghu May 04 '19 at 16:41

Best way to scale across different datasets

2 Answers2