0

Can someone explain why in a lot of machine learning competitions, many participants perform some statistical aggregations like mean and median aggregation for numerical features?

How can this improve the performance of the machine learning model ? and how to implement these statistical aggregations?

Example :

    # Also we created statistical 'max_mean' and 'range' features which noticeably improved score :
def new_features(df):
    for col in agg_stat_columns:
        df[col+'_range'] = df[col+'_max'] - df[col+'_min']
        df[col+'_max_mean'] = df[col+'_max']/df[col+'_mean']
        df[col +'_median'] = df[col].median()
    return df

thanks ,

  • Aggregation techniques like [bagging](http://stanfordphd.com/Bagging.html) allow for variance reduction. – stans Aug 09 '23 at 05:26
  • Hi @Mohamed7894, welcome to the site. Can you please provide some links with examples of the aggregations you mention? – noe Aug 09 '23 at 11:21
  • No, I don't mean bagging, it's feature engineering. I've included an example in the question. – Mohamed7894 Aug 09 '23 at 12:46

0 Answers0