Understand the reason of embedding and the size inside it in Pytorch

Question

I'm very new to pytorch - taking a course in udemy. There is something I find hard to understand and would like to get explaination about, in a bit simpler words than what I can find in the documentation.

one of the lessons is about solving regression problem with ANN with pytorch.
The original dataframe contains categorical and continous data.

We create emnedding sizes only for the categorical data columns and use also size parameter, which is calculated with formula that I don't understand:

# set embessing size
cat_cols=['sex','marital-status','education']
cat_szs=[len(df[col].cat.categories) for col in cat_cols]

#why 50???
emb_szs=[(size, min(50, (size+1)//2)) for size in cat_szs]
emb_szs

I don't understand the following: 1. cat_szs is the number of unique values in a column (category). What is the embedding size calculation? why do we have it?
2. How to interpret the embessing size values?

[(14, 7), (2, 1), (6, 3)] I don't understand what are the 7,2,3 stands for. I know they come from the formula but what does it mean?

3. Why embedding is needed? I understand is a way to represent categorical data, and is always needed for categorical data,am I correct?

score 1 · Answer 1 · answered Jul 04 '22 at 15:32

To answer your question.

First think of only one column. Let's say your column has 10 unique values, so 10 categories. So, cat_szs will be 10.

Then emb_szs will be calculated by the emb_szs=[(size, min(50, (size+1)//2)) for size in cat_szs].

Using the formula the your embedding layer will be of shape [(10,5)]. For every category you will have 5 dimension representing it. If you had label encoding you would represent it from 1,..., 10. But Now 1 would be for eg. [0.2, 0.3, 0.1, 0.4, 0.6] . Similary all the categories will have 5 dimensional representation.
It is a way to represent categorical data. It is not always needed. You can also use Label Encoding , One Hot encoding, or other encodings. The advantage is that since you have a higher dimension representation, your model understands the data in a better way.

You can look at below links to understand embeddings.

Understand the reason of embedding and the size inside it in Pytorch

1 Answers1