1

I have a XGBoost model and I'm going to retrain it by adding new features. There is a column in my data and it's about professions of the customers. It has 60 categories. I suppose there is no need to convert them to dummy variables because tree based models can handle them, but I thought that there should be many splits in order to do it and I decided to use a subset of categories and group other categories under one category. To decide categories which I'll keep I applied one-hot encoding to all of them and applied chi-square test by using scipy.stats.chi2_contingency to each generated dummy column and target variable. Then I sorted columns by test statistic in ascending order and picked first 10 of them. Then in the original column I kept the values that also in the subset and assigned same category to others. I'm not sure it's a proper method or is there any inconsistency in it? Any suggestions?

My code is as below:

from scipy.stats import chi2_contingency

def get_n_cats(df, col_to_cat, n, target="TARGET_GPL_SATIS"):
    # Apply one-hot encoding
    dummies = pd.get_dummies(df[col_to_cat])

    # Calculate chi-square statistic for each dummy column
    scores = pd.DataFrame(index=dummies.columns, columns=["Score"])
    for col in dummies.columns:
        cont_table = contingency_table(dummies[col], df[target])
        score = chi2_contingency(cont_table)[0]
        scores.loc[col,"Score"] = score

    # Sort by score and get first n columns
    scores.sort_values(by="Score", ascending=True, inplace=True)
    first_n_columns = scores.index[:n]

    # Create mapping dict
    mapping = create_mapping(first_n_columns)

    # Preserve only first n columns as categories
    return [mapping[val] if val in mapping.keys() else mapping["other"] for val in df[col_to_cat]]

def contingency_table(c1, c2):
    """Calculates contingency table between provided columns."""
    df = pd.DataFrame({"c1":c1, "c2":c2})
    return df.groupby(['c1','c2']).size().unstack(fill_value=0).values

def create_mapping(first_n_columns):
    """Creates mapping dict."""
    mapping = {cat:code+1 for code, cat in enumerate(first_n_columns)}
    mapping[np.nan] = 0
    mapping["other"] = len(first_n_columns) + 1
    return mapping
tkarahan
  • 422
  • 5
  • 14
  • 1
    So, I would say that what you did is not wrong. Nevertheless usually what is done to reduce the number of levels in a categorical variable is simply to count the frequency of each level, keep the most frequents and replace the others with another label. I guess you could try both and see what methods turns out to be more effective in your particular case. – Edoardo Guerriero Apr 03 '20 at 13:47
  • By the way, I suppose only using statistic returned from the function is wrong. I made some search and saw that after statistics obtained, people also calculates a critical value from chi-square distribution and compare it with this critical value. If the returned statistic is below the critical value then they conclude that there is no relationship between categorical variables, but if the returned statistic is above the critical value then they conclude that there is a relationship between them. Now I wonder that whether difference between statistic and critical value is also important? – tkarahan Apr 05 '20 at 09:47
  • I asked same things I commented above in cross-validated stack exchange and it gets an informative reply. Here is the link: https://stats.stackexchange.com/questions/458574/chi-square-test-confidence Main theme is use p-values instead of critical values to compare independence. Because they are always in the same scale and not dependent on experiment size. – tkarahan Apr 06 '20 at 07:08

0 Answers0