2

Let's say I put the following two datasets in the best possible model (same model for both):

  • A raw dataset, the variables as they came just from the query.
  • A feature-engineered dataset, with hundreds of created variables, which came from the same raw dataset I just mentioned.

Could the difference between both AUCs be high? How much?

  • 1
    Any ground-rules here, on what "raw vs feature-engineered" and "best possible model" can mean? – Ben Reiniger Jan 17 '20 at 21:58
  • Yes. Raw: The variables have missing values, none grouping variable is derived (mean by group or similar), no summations A+B, or A-B, ratios, A/B or similar are calculated. Feature-Engineered: Mean-encoding, Frequency encoding, Impact-encoding, separation in ranges, ranks, lagged variables. a new variable defined from cluster. Best model: Let's say XGBoost. – Juan Esteban de la Calle Jan 17 '20 at 22:07

2 Answers2

3

Yes, the performance can vary a lot using feature engineering.

Example: suppose a dataset where the response variable $y$ is true if $x$ is odd.

x    y

346  F
13   T
178  F
64   F
987  T
...

Most learning models will fail to identify the pattern and will perform poorly, usually falling back to always predicting the majority class. However simply adding a feature $x \% 2$ to the data will allow any model to perform perfectly.

Of course this a toy example, but the point is that a single well chosen feature can drastically change the performance. Naturally the increase in performance totally depends on the data and the nature of the features added.

Erwan
  • 24,823
  • 3
  • 13
  • 34
2

I would say that the best possible model for the raw data would derive all the meaningful features that you would have created from the data anyway.

And I would say that the best possible model for the feature-engineered model will remove/ignore unnecessary features.

The best possible model would have AUC of 1 anyway. It makes all predictions correctly.

But even in the context of noise where AUC of 1 can not be achieved, I think the argument holds.

But learning rate/convergence speed may vary.

Pieter21
  • 1,031
  • 6
  • 7