Consider the very basic example below:
X = data.drop("Price", axis = 1)
y = data["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
scaler = MinMaxScaler()
model = LinearRegression()
scaler.fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)
model.fit(X_train_s, y_train)
model.score(X_test_s, y_test)
The above code simply splits the data into inputs and outputs, for training and testing, scales the inputs using the scaler object, uses the data to train a Linear Regression model, and then test the model.
Now suppose I am satisfied with the results, so I can serialize the model to a jobilb file, but any data that goes in has to be scaled first, right? So, should I do something like below?
joblib.dump((scaler, model), "scaler_and_model.joblib")
Similarly, some models might use encoders like OrdinalEncoder() or OneHotEncoder(). And the scaler.fit() function is supposed to "change" or "update" the scaler/encoder to make it work well with your particular data, right?
So, I have this hunch that you need to save/keep the scaler to be able to properly use your trained model later on. So, what's the right thing to do in this situation?
What's the general strategy with deploying a trained model that used an encoder or scaler for training?