0

Suppose to have a set of elements $X = \{x_1, x_2, ..., x_n\}$. Each element is characterised by a set of features. The features characterising a particular element $x_i$ can belong to one of $q$ different categories. Each different category $f_q$ can have a different value $v_{q_i}$, belonging to a set of possible values $V_q = \{v_{q_1}, v_{q_2} ...\}$. So, an observation $x_i$ may be described as $x_i = \{f_{q_1} = v_{{q_1}_i}, f_{q_1} = v_{{q_1}_j}, ... f_{q_i} = v_{{q_i}_i}\}$. In order to make myself extra-clear, I will state some property of the element $x_i$, along with an example.

For an element $x_i$:

  1. It is possible that a particular feature category $f_{q_i}$ may appear more than once, but with a different value. So, for example, $x_i = \{..., f_{q_i} = 1, f_{q_i} = 4, ...\}$ (1, 4 are examples of values);
  2. Both $f_{q_i}$ (the id used to describe the category) and its set of values $V_{q_i}$ are categorical variables.
  3. An element $x_i$ may have an arbitrary number of feature categories describing it.
  4. The number of unique pairs of $(f_{q_i}, v_{{q_i}_k})$ appearing in the dataset is around $903$.
  5. The set of elements $X$ can appear in a set of observations $O = \{o_1, o_2, ..., o_m\}$, whose generic element $o_j$ can group multiple $x_i \in X$ as a sequence, and a final object $x_j$. The scope of the problem is to infer, for some given observations $O_{-j}$ the last element $x_j$, given the sequence.

How would you convert the categorical features in a meaningful way? I want to underline that my question refers only on how to convert these features within this particular setting, as the approaches I tried so far cannot solve the problem of an element sharing multiple arbitary categories of feature. The problem has been stated only because I wanted to make clear that target encoding is not a viable option, or is just possible in terms of how many times a particular pair $(f_{q_i}, v_{{q_i}_k})$ appears as belonging to an object in the set of "final elements" $X_J = \{x_{j_1}, x_{j_2}, ..., x_{j_m}\}$. This was, in fact, my initial approach. It is also possible that I have not fully understood target encoding, at this point.

King Powa
  • 111
  • 2

1 Answers1

1

One-hot encoding is an option, although in this case there will be more than one "hot" bits, but can still be used.

Many numerical encodings which allow combinations are possible. Restrictions are only regarding high dimensionality (which I think you cannot avoid).

Eg. a variation of numerical encoding is possible and if a certain element has more than one different values for same category then the sum of individual values can be used, provided it does not coincide with the numeric code of another value (this may or may not be desirable, eg the categories are correlated).

And so on..

Nikos M.
  • 2,301
  • 1
  • 6
  • 11
  • I have some doubts. OHE was a possible solution, still we have a problem regarding high dimensionality. I have thought about numerical encoding, but I should first enforce a sort of ordering on my variables (for example, values may be ordered based on their frequency), but I have doubts about the fact that such approach will mantain a similarity between elements with the same value. For example, if an element has feature a = 1 and a second element has features a = 1 and a = 2, how can I express the similarity between those two items? – King Powa May 26 '22 at 18:52
  • 1
    The way the numerical codes can be combined provides a way to model similarity. For example if codes are prime numbers and combination is related to products then sharing common factors is a similarity measure. – Nikos M. May 26 '22 at 19:01
  • Can you point me some automatic methods to do that? I could not find any library/resource talking about it. Anyway, I was also considering entity embedding as a possible solution, but I wanted to try other methods first. – King Powa May 26 '22 at 19:29
  • I doubt there is automatic method to meet your requirements. You have to do a custom encoding – Nikos M. May 26 '22 at 19:34
  • For entity embedding (which is a way to numerically encode categories) see https://towardsdatascience.com/entity-embeddings-for-ml-2387eb68e49 – Nikos M. May 26 '22 at 19:40
  • Thank you for entity embedding, but I was already aware of it. Anyway, there is a instrinsic difficulty related to the fact that, for my problem, I have no domain knowledge, i.e. categories meaning are not provided. So it's hard to find a reliable numeric method. – King Powa May 26 '22 at 20:23
  • Maybe trying to cluster the data based on the categorical features can help provide some intuition in order to proceed further – Nikos M. May 26 '22 at 21:12