Are scalers or encoders supposed to be serialized along with trained models?

Question

Consider the very basic example below:

X = data.drop("Price", axis = 1)
y = data["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

scaler = MinMaxScaler()
model = LinearRegression()

scaler.fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)

model.fit(X_train_s, y_train)
model.score(X_test_s, y_test)

The above code simply splits the data into inputs and outputs, for training and testing, scales the inputs using the scaler object, uses the data to train a Linear Regression model, and then test the model.

Now suppose I am satisfied with the results, so I can serialize the model to a jobilb file, but any data that goes in has to be scaled first, right? So, should I do something like below?

joblib.dump((scaler, model), "scaler_and_model.joblib")

Similarly, some models might use encoders like OrdinalEncoder() or OneHotEncoder(). And the scaler.fit() function is supposed to "change" or "update" the scaler/encoder to make it work well with your particular data, right?

So, I have this hunch that you need to save/keep the scaler to be able to properly use your trained model later on. So, what's the right thing to do in this situation?

What's the general strategy with deploying a trained model that used an encoder or scaler for training?

score 1 · Answer 1 · answered Jun 24 '23 at 18:53

1

There would be no need to do either of the following:

save the scalers
save the scaled version of the data

Look at this code:

scaler = MinMaxScaler()
model = LinearRegression()

Those are saved already: in the scikit-learn libraries. As long as your project properly pins their versions in the requirements.txt you will always have the same binaries every time. You might also want to use more recent updates to them in future iterations in that case simply include them in the requirements.txt without any pinned version.

Likewise the scaled version of the data itself does not need to be saved: you simply re-apply the scalers to the same data and get the same scaled outputs.

answered Jun 24 '23 at 18:53

WestCoastProjects

191
9

Either you misunderstood the question, or I am having trouble understanding your answer – Muhammad Usman Jun 25 '23 at 13:18
You're not understanding the answer. Ask yourself: do you want to store _static_ data and binaries with your model? The `MinMaxScaler` library method is _static_ data that can be read from the python libraries, why store it with your trained model? – WestCoastProjects Jun 25 '23 at 15:12
Yes, but let's say I do `MinMax().fit(xTrain)`. Now this particular scaler has been edited specifically for your features. Are you saying that doesn't matter and its ok to create a new scaler each time I use the trained model practically to predict something? – Muhammad Usman Jun 26 '23 at 14:39
So AFTER training and deployment, every time I can so something like: line1: `input = [[5, 7, 99, -4]]`, line 2: `prediction = model.predict(MinMax().fit_transform(input))` – Muhammad Usman Jun 26 '23 at 14:54
Have I understood your answer correctly? – Muhammad Usman Jun 26 '23 at 15:00
I kinda see your point. The output of `MinMax().fit(xTrain)` is a model: it can _maybe_ make some sense to save that. I usually think of only saving the _final_ model though - not the intermediate steps. Its up to you. – WestCoastProjects Jun 26 '23 at 16:27
I am sorry, but I have to disagree. Preprocessing should be seen as part of the model and stored together with it. – Broele Jun 26 '23 at 23:18
@Broele The typical pattern I have seen across dozens of projects is to save _just_ the model. I'll stick with this. – WestCoastProjects Jun 27 '23 at 01:15
Ah, that is interesting, @WestCoastProjects. Would you mind to share some insides, (e.g. why you decided to do so and how to handle fitted preprocessors when deploying the model)? – Broele Jun 27 '23 at 08:16

score 1 · Answer 2 · answered Jun 26 '23 at 23:16

I have to disagree with the other answer: You need to save everything that was trained (i.e. everything with a fit-method). When it comes to deploying your model, you should consider the whole preprocessing pipeline part of your model. Some reasons for that are given below

Saving preprocessed data

This only works in a proof-of-concept or research scenario. If you aiming to deploy the model for a real-world application, you need to be able to handle new data that is not available when training & saving the model.

Fitting the preprocessor on the fly

Some comments proposed to do something like

input = [[5, 7, 99, -4]]
prediction = model.predict(MinMax().fit_transform(input))

There are multiple reasons not to do so:

1. Little data at once

Imagine after deployment, your model gets like one call per minute and you need for a good fitting of your preprocessor what? At least 10000 samples? Maybe more. That would mean you would just collect data for one week before your model would be able to produce any prediction.
Of course these numbers are just made up, but you get the point.

2. Robustness of the preprocessing

Some preprocessors are sensitive to outliers. Imagine you build a model to predict traffic jams. One feature are the observed cars per hour. Your training data had normal and crowded days, but after deploying, there is suddenly a super-crowded day that happens just once in 5 years (for whatever reasons). Training the min-max-preprocessor newly after saving the model, would put max (i.e. 100%) to this super-crowded day. Now your model would be fed with, lets say a min-max-transformed value of 50% of the max on a crowded (but not super-crowded) day. But it was trained in a way that 50% means not crowded, but low traffic. You can imagine that this model would fail to predict traffic jams in this scenario.

A saved preprocessor on the other hand would produce something like 200% for the super-crowded day and the expected 100% on a crowded day.

3. Distribution shift

This is quite similar to the previous point. If the distribution changes over time, so will a newly trained preprocessor. In the example above: if more and more people own and drive cars, but the roads do not change, one would expect more traffic jams, but the preproccessor would hide this change and the model would not work well.

Keep everything together

The last point is no data-driven reason, but comes from the fact that some models might be used for years. It makes live much easier for future users, if everything is in one place.

Scikit-learn

In sklearn, you could use a Pipeline to combine preprocessing and prediction into one model. Just save and load the pipeline.

Thankyou for the detailed explanation – Muhammad Usman Jun 30 '23 at 18:31 — Muhammad Usman, Jun 30 '23 at 18:31

Multivac · Accepted Answer · 2023-06-29T19:17:13.397

The general strategy is you should always use Pipelines because in that way you will serialize the full process of scalling + preprocessing + modeling. So at the end you only need to save a single object from which you will be able to access to every step, either calling or predicting.

Your code should look like this:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import joblib

# Separate features and target variable
X = data.drop("Price", axis=1)
y = data["Price"]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Create a pipeline
pipeline = Pipeline([
    ('scaler', MinMaxScaler()),  # Apply MinMaxScaler
    ('model', LinearRegression())  # LinearRegression model
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Calculate the score on the testing data
score = pipeline.score(X_test, y_test)
print("R2 Score:", score)

# Serialize the final model
joblib.dump(pipeline, 'model.pkl')

Hope it helps!

Tysm for providing both the answer and the code – Muhammad Usman Jun 30 '23 at 18:32 — Muhammad Usman, Jun 30 '23 at 18:32

score 1 · Answer 4 · answered Jun 30 '23 at 05:20

I have to agree woth @Broele. Any kind of preprocessing that you do to train your model, you need to save it for future use also. So you need to save your encoders, scalars, imputation algorithms along with your model.

@WestCoastProjects answer is absolutely incorrect as it will lead to incorrect results during deployment, if we go by his logic. That is because, any scalar/encoder you use during training, you have to use that same for testing and during deployment as the encoder/scalar has been trained on your specific dataset.