Can someone explain why in a lot of machine learning competitions, many participants perform some statistical aggregations like mean and median aggregation for numerical features?
How can this improve the performance of the machine learning model ? and how to implement these statistical aggregations?
Example :
# Also we created statistical 'max_mean' and 'range' features which noticeably improved score :
def new_features(df):
for col in agg_stat_columns:
df[col+'_range'] = df[col+'_max'] - df[col+'_min']
df[col+'_max_mean'] = df[col+'_max']/df[col+'_mean']
df[col +'_median'] = df[col].median()
return df
thanks ,