What would be a good threshold number to convert some categories to Not Specified

Question

The current topic is derived from this topic: https://community.rstudio.com/t/39722 (optional to read).

I have the following dataset: myds :

str(myds)

'data.frame':   841500 obs. of  30 variables:
 $ score                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ amount_sms_received       : int  0 0 0 0 0 0 3 0 0 3 ...
 $ amount_emails_received    : int  3 36 3 12 0 63 9 6 6 3 ...
 $ distance_from_server      : int  17 17 7 7 7 14 10 7 34 10 ...
 $ age                       : int  17 44 16 16 30 29 26 18 19 43 ...
 $ points_earned             : int  929 655 286 357 571 833 476 414 726 857 ...
 $ registrationYYYY          : Factor w/ 2 levels ...
 $ registrationDateMM        : Factor w/ 9 levels ...
 $ registrationDateDD        : Factor w/ 31 levels ...
 $ registrationDateHH        : Factor w/ 24 levels ...
 $ registrationDateWeekDay   : Factor w/ 7 levels ...
 $ catVar_05                 : Factor w/ 2 levels ...
 $ catVar_06                 : Factor w/ 140 levels ...
 $ catVar_07                 : Factor w/ 21 levels ...
 $ catVar_08                 : Factor w/ 1582 levels ...
 $ catVar_09                 : Factor w/ 70 levels ...
 $ catVar_10                 : Factor w/ 755 levels ...
 $ catVar_11                 : Factor w/ 23 levels ...
 $ catVar_12                 : Factor w/ 129 levels ...
 $ catVar_13                 : Factor w/ 15 levels ...
 $ city                      : Factor w/ 22750 levels ...
 $ state                     : Factor w/ 55 levels ...
 $ zip                       : Factor w/ 26659 levels ...
 $ catVar_17                 : Factor w/ 2 levels ...
 $ catVar_18                 : Factor w/ 2 levels ...
 $ catVar_19                 : Factor w/ 3 levels ...
 $ catVar_20                 : Factor w/ 6 levels ...
 $ catVar_21                 : Factor w/ 2 levels ...
 $ catVar_22                 : Factor w/ 4 levels ...
 $ catVar_23                 : Factor w/ 5 levels ...

Now, let’s do an experiment, let’s take the discrete variable: catVar_08 and let’s count for each of its values, on how many observations that value shows up. The table below will be sorted in decreasing order by the amount of observations:

varname = "catVar_08"
counts = get_discrete_category_counts(myds, varname)

counts

## # A tibble: 1,571 x 2
## # Groups:   catVar_08 [1,571]
##    catVar_08            count
##    <chr>                <int>
##  1 catVar_08_value_415  83537
##  2 catVar_08_value_463  68244
##  3 catVar_08_value_179  65414
##  4 catVar_08_value_526  59172
##  5 catVar_08_value_195  49275
##  6 catVar_08_value_938  26834
##  7 catVar_08_value_1142 25351
##  8 catVar_08_value_1323 23794
##  9 catVar_08_value_1253 18715
## 10 catVar_08_value_1268 18379
## # ... with 1,561 more rows

Let’s check the counts above more deeply. In order to do that, let’s do some plots on different ranges:

plot_discrete_category_counts(myds, varname, 1, 500)

plot_discrete_category_counts(myds, varname, 501, 1000)

plot_discrete_category_counts(myds, varname, 1001)

As we can see on the second plot above, if we keep on our analysis the first 500 categories with higher amount of observations and remove the rest (1571 - 500 = 1071) by replacing those values to a generic value like: Not Specified , then we are going to be using categories that have been observed more than around 40 times.

Another very good advantage of doing such level reduction is that the computational effort when training our model will be considerably less which is very important.

Bear in mind that this dataset have: 841500 observations.

Let’s do another similar experiment with another discrete variable: zip :

varname = "zip"
counts = get_discrete_category_counts(myds, varname)

counts

## # A tibble: 26,458 x 2
## # Groups:   zip [26,458]
##    zip             count
##    <chr>           <int>
##  1 zip_value_18847   428
##  2 zip_value_18895   425
##  3 zip_value_25102   425
##  4 zip_value_2986    422
##  5 zip_value_1842    414
##  6 zip_value_25718   410
##  7 zip_value_3371    397
##  8 zip_value_11638   395
##  9 zip_value_4761    394
## 10 zip_value_6746    391
## # ... with 26,448 more rows

Let’s check the counts above more deeply. In order to do that, let’s do some plots on different ranges:

plot_discrete_category_counts(myds, varname, 1, 1000)

plot_discrete_category_counts(myds, varname, 1001, 2500)

plot_discrete_category_counts(myds, varname, 2501)

On this case, if we keep on our analysis the first 2500 categories with higher amount of observations and remove the rest (26458 - 2500 = 23958) by replacing those values to a generic value like: Not Specified , then we are going to be using categories that have been observed more than around 100 times.

Another very good advantage of doing such level reduction is that the computational effort when training our model will be considerably less which is very important.

Bear in mind that this dataset have: 841500 observations.

Highlight:

On the second case, we were analyzing the zip code which is a very common discrete variable. So, I would say this is a common use case.

My Question:

What would be a good threshold number or formula X , such that when for a given discrete variable, we have categories/values that have been seen less than X times on the dataset observations, it worth to remove them?

For example, probably if for catVar_08 above we have some categories/values that have been seen 5 times, probably we should not take them into consideration and replace that value by Not Specified because most likely that the machine learning model will not have the ability to learn about those categories/values with few observations.

One possible signature for the formula would be:

num_rows_dataset = <available above>
num_cols_dataset = <available above>
discrete_category_counts = <available above>
arg_4 = ...
arg_5 = ...
...
get_threshold_category_observations = function(
  num_rows_dataset,
  num_cols_dataset,
  discrete_category_counts,
  arg_4,
  arg_5,
  ...
) {
  threshold = ...
  return (threshold)
}

Any idea/recommendation about this?

What would be a good threshold number to convert some categories to Not Specified

Highlight:

My Question:

Thanks!

0 Answers0